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Title: Method of compilation 

5 The present invention relates to the subject matter claimed and hence refers to a 
method and a device for compiling programs for a reconfigurable device. 

Reconfigurable devices are well-known. They include systolic arrays, neuronal 
networks, Multiprocessor systems, Prozessoren comprising a plurality of ALU 

10 and/or logic cells, crossbar-switches, as well as FPGAs, DPGAs, XPUTERs, asf.. 
Reference is being made to DE44 16 881 A1, DE 197 81 412 A1, DE 197 81 483 
A1. DE 196 54 846 A1. DE 196 54 593 A1. DE 197 04 044.6 A1, DE 198 80 129 
A1, DE 198 61 088 A1, DE 199 80 312 A1, PCT/DE 00/01869, DE 100 36 627 
A1, DE 100 28 397 A1, DE 101 10 530 A1. DE 101 11 014 A1. PCT/EP 

15 00/10516, EP 01 102 674A1,DE198 80 128A1,DE101 39 170A1.DE198 09 
640 A1. DE 199 26 538.0 A1 , DE 100 050 442 A1 the full disclosure of which is 
incorporated herein for purposes of reference. 

Furthermore, reference is being made to devices and nriethods as known from US 
20 PS 6,31 1 ,200; US PS 6,298,472; US PS 6,288,566; US PS 6,282,627; US PS 
6,243,808 issued to Chameleonsystems INC, USA noting that the disclosure of 
the present application is pertinent in at least some aspects to some of the de- 
vices disclosed therein. 

25 The invention will now be described by the following papers which are part of the 
present application. 
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1. Introduction 

This document describes the PACT Vectorising C Compiler XPP-VC which maps 
a C subset extended by port access functions to PACTs Native Mapping Lan- 
5 guage NML. A future extension of this compiler for a host-XPP hybrid system Is 
described in Section 7.3. 

XPP-VC uses the public domain SUIF compiler system. For installation Instruc- 
tions on both SUIF and XPP-VC, refer to the separately available installation 
10 notes. 

2. General Approach 

The XPP-VC Implementation is based on the public domain SUIF compiler 
15 framewori< (cf . http : //suif . Stanford . edu). SUIF was chosen because it 
is easily extensible. . 

SUIF was extended with two passes: partition and nialgen. The first pass, 
partition, tests if the program complies with the restrictions of the compiler (cf. 

20 Section 3.1) and perfomris a dependence analysis. It determines if a FOR-loop 
can be vectorized and annotates the syntax tree accordingly. In XPP-VC, vectori- 
zatlon means that loop iterations are overiapped and executed in a pipelined, 
parallel fashion. This technique is based on the Pipeline Vectorization method 
developed for reconfigurable architectures^ partition also completely unrolls 

25 inner program FOR-loops which are annotated by the user. All innemnost loops 
(after unrolling) which can be vectorized are selected and annotated for pipeline 
synthesis. 

nmigen generates a control/dataflow graph for the program as follows. First, pro- 
30 gram data is allocated on the XPP Core. By default, nmigen maps each program 
an^y to intemal RAM blocks while scalar variables are stored in registers within 

' Cf. M. Weinhardt and W. Luk: Pipeline Vectorization, IEEE Transactions on Computer-Alded 
Design of Integrated Circuits and Systems, Feb. 2001 . pp. 234-248. 
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the PAEs. If instructed by a pragma directive (cf. Section 3.2.2), arrays are 
mapped to external RAM. If it is large enough, an external RAM can hold several 
arrays. 

5 Next, one ALU is allocated for each operator in the program (after loop unrolling, 
if applicable). The ALUs are connected according to the data-flow of the program. 
This data-driven execution of the operators automatically yields some instruction- 
level parallelism within a basic block of the program, but the basic blocks are 
nonnally executed in their original, sequential order, controlled by event signals. 

10 However, for generating more efficient XPP Core configurations, nmlgen gener- 
ates pipelined operator networks for inner program loops which have been anno- 
tated for vectorization by partition. In other words, subsequent loop iterations 
are stated before previous iterations have finished; Data packets flow continu- 
ously through the operator pipelines. By applying pipeline balancing techniques, 

15 maximum throughput is achieved. For many programs, additional performance 
gains are achieved by the complete loop unrolling transfomiatlon. Though un- 
rolled loops require more XPP resources because individual PAEs are allocated 
for each loop iteration, they yield more parallelism and better exploitation of the 
XPP Core. 

20 

Finally, nmlgen outputs a self-contained NML file containing a module which im- 
plements the program on an XPP Core. The XPP IP parameters for the generated 
NML file are read from a configuration file. cf. Section 4. Thus the parameters can 
be easily changed. Obviously, large programs may produce NML files which can- 
25 not be placed and routed on a given XPP Core. Later XPP-VC releases will per- 
form a temporal partitioning of C programs In order to overcome this limitation, cf. 
Section 7.1. 

3. Language Coverage 

30 

This Section describes which C files can currently be handled by XPP-VC. 
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3.1 Restrictions 
3.1.1 XPP Restrictions 

The following C language operations cannot be mapped to an XPP Core at all. 
They are not allowed in XPP-VC programs and need to be mapped to the host 
processor in a codesign compiler; cf. Section 7.3, 

10 • Operating System calls, including I/O 

• Division, modulo, non-constant shift and floating point operations (unless XPP 
Core's ALU supports them)^ 

15 • The size of an^ys mapped to internal RAI^s is limited by the number and size 
of intemal RAM blocks. 

3.1 .2XPP-VC Compiler Restrictions 

20 The current XPP-VC Implementation necessitates the following restrictions: 

1 . No multi-dimensional constant an-ays (due to the SUIF version cun-ently used) 

2. No switch/case statements 

3. No s t r u c t data types 

4. No function calls except the XPP port and pragma functions defined in Sec- 
tion 3.2.1 . The program must only have one function (main). 

5. No pointer operations 

6. No library calls or recursive calls 



25 



30 



^ In future XPP-VC releases, an alternative, sequential implementation of these operations by 
NML macros will be available. 
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7. No irregular control flow (break, continue, goto, label) 

Additionally, there are cunrently some innplementation-dependent restrictions for 
vectorized loops, cf. the Release Notes. The compiler produces an explanatory 
message if an inner loop cannot be pipelined despite the absence of dependen- 
cies. However, for many of these cases, simple workarounds by minor program 
changes are available. Furthemiore, programs which are too large for one con- 
figuration cannot be handled. They should be split into several configurations and 
sequenced onto the XPP Core, using NML's reconfiguration commands. This will 
be performed automatically in later releases by temporal partitioning, cf. Section 
7.1. 

3.2 XPP-VC C Language Extensions 

We now describe useful C language extensions used by XPP-VC. In order to use 
these extensions, the C program must contain the following line: 

#include "XPP.h" 

This header file. XPP.h, defines the port functions defined below as well as the 
pragma function XPP_unroll ( ) . If XPP_unroll ( ) directly precedes a FOR 
loop, it will be completely unrolled by partition, cf. Section 6.2. 

3.2.1 XPP Port Functions 

Since the nomial C I/O functions cannot be used on an XPP Core, a method to 
access the XPP I/O units in port mode is provided. XPPM contains the definition 
of the following two functions: 

XPP_getstreain(int ionum, int portnum, int *value) 
XPP _j)ut stream (int ionum, int portnum, int value) 

5 

SUBSTITUTE SHEET (RULE 26) 



wo 03/071418 



PCT/EP03/00624 



ionum refers to an I/O unit (1 ..4), and portnum to the port used in this I/O unit (0 
or 1). For the duration of the execution of a program, an I/O unit may only be used 
either for port accesses or for RAM accesses (see below). If an I/O unit is used in 
port mode, each portnum can only be used either for read or for write accesses 
5 during the entire program execution. In the accisss functions, value is the data 
received from or written to the stream. Note that XPP getstream can currently 
only read values into scalar variables (not directly into array elements!), whereas 
XPP putstream can handle any expressions. An example program using these 
functions is presented in Section 6.1. 

10 

3.2.2 pragma Directives 

Arrays can be allocated to external memory by a compiler directive: 

15 ipragma extern <var> <RAM_niamber> 

Example: #pragma extern x 1 maps array x to external memory bank 1 . 
Note the following: 

20 

• <var> must be defined before A is used in the pragma. 

• Bank <RAM_number> must be declared in the file xppvcjopUons, cf. Section 4. 

25 • If two arrays are allocated to the same external RAM bank, they are arranged 
in the order of appearance of their respective pragma directives. The resulting 
offsets are recorded in fifte.//f, cf. Section 5.1 
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4. Directories and Files 

After correct Installation, the XPPC_ROOT environment variable is defined, and tlie 
PATH variable extended, $XPPC_ROOT is the XPP-VC root directory. 
5 $XPPC_ROOT/bin contains all binary files and the scripts xppvcmake and xppgcc. 
$XPPC_ROOT/doc contains this manual and the file xppvc_releasenotes.txt XPP.h 
is located in the include subdirectory. 

Finally, $XPPC_ROOT/lib contains the options file xppvc joptions. If an options file 
10 with the same name exist in the current working directory or the .xcfs subdirectory of 
the user's home directory, they are used (in this order) instead of the master file in 
$XPPC_ROOT/iib. 



Option 


Explanation 


Default value in 

Xppvc_options 


debug 


debug output enabled 


on 


version 


XPP IP version 


V2 


pacsize 


number of ALU-PAEs in x and y direction 


6/12 


xppsize 


number of PACs in x and y direction 


1/1 


busnumber 


number of data and event buses per row (both dir.s) 


6/6 


iramsize 


number of words in one intemal RAM 


256 


bitwidth 


XPP data bid width 


32 


freg_data_port 


number of FREG data ports 


3 


breg_dataj)orl 


number of BREG data ports 


3 


freg_event_port 


number of FREG event ports 


4 


breg_eventj)ort 


number of BREG event ports 


4 



15 

Table 1 : Options 

xppvcjoptions sets the compiler options listed in Table 1 . Most of them define the 
XPP IP parameters which are used in the generated NML file. Lines starting with a # 
20 character are comment lines. 
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Additionally, extram followed by four integers declares the external RAM banks 
used for storing arrays. At most four external RAMs can be used. Each integer 
represents the size of the bank declared. Size zero must be used for banks which 
do not exist. The master file contains the following line which declares four 4GB 
5 (1 G words) external banks: 

extram 1073741824 1073741824 1073741824 1073741824 

Note that, in order to simplify programming, xppvc_options does not have to 
10 be changed if an I/O unit is used for port accesses. However, this memory bank is 
not available in this case despite being declared. 

5. Using XPP-VC 

15 5,1 xppvcmake 

In order to create an NML file, file.c is compiled with the command xppvcmake 
file . nml. xppvcmake file . xbin additionally calls xmap. With xppvcmake, 
XPRh is automatically searched for in directory $XPPC_ROOT/include. 

20 

The following output produced by translating the example program streamfir.c In 
Section 6.1 shows the programs called by xppvcmake: 

$ xppvcmake streamf ir . nml 
25 pscc -I/home/wema/xppc/include -parallel -no P0RKY_F0RWARD_PRDP4 
-.spr streamfir.c 
porky -dead-code streamf ir. spr streamf ir. spr2 
partition streamf ir . spr2 streamf ir. svo 
Program analysis: 
30 main: DO-LOOP^ line 9 can be synthesized 
main: can be synthesized completely 
Program partitioning: 

Entire program selected for XPU module synthesis. 
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main: DO-LOOP, line 9 selected for synthesis 
porky -const-prop -scalarise -copy-prop -dead-code streamf ir . svo 

streamf ir -. svol 
predep -normalize streamf ir. svol streamf ir . svo2 
5 porky -ivar -know-bounds -fold streamf ir. svo2 streamf ir. sur 
nmlgen streamf ir. sur streamfir.xco 

pscc is the SUIF frontend which translates steamHr.c into the SUIF intermediate 
representation, and porky performs some standard optimizations. Next, parti- 

10 tion analyses the program. The output indicates that the entire program can and 
will be mapped to NML Then porky and predep perform some additional optimi- 
zations before nmlgen actually generates the file streamfinnml. The SUIF file 
streamUnxco is generated to Inspect and debug the result of code transformations.^ 
In the generated NML file, only the I/O ports are placed. All other objects are placed 

15 automatically by xmap. Cf. Section 6.1 for an example of the xsim program using 
the I/O porte corresponding to the stream functions used in the program. 

For an input file file.c, nmlgen also creates an interface description file file.itf 'm the 
working directory. It shows the array to RAM mapping chosen by the compiler. In the 

20 debug subdirectory (which is created), files file.partjdbg and file.nmlgenjdlbg are 
generated- They contain more detailed debugging information created by parti- 
tion and nmlgen respectively. The files file^firstdot and We^HnaLdot created in the 
debug directory can be viewed with the dotty graph layout tool. They contain 
graphical representations of the original and the transformed and optimized version 

25 of the generated control/dataflow graph. 



^ In an extended codesign compiler, the .xco file would also be used to generate the host partition of 
the program. 

9 
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5-2 X p p g c c 

This command is provided for comparing simulation results obtained with 
5 xppvcmake, xmap and xsim (or from execution on actual XPP hardware) with a 
"direct" compilation of the C program with gcc on the host, xppgcc compiles the 
input program with gcc and binds it with predefined XPP_getstreain and 
XPP^putstream functions. They read or write files port<n>_<m>.dat in the cur- 
rent directory for n in 1..4 and m in 0..1. For instance, the program in Section 6.1 
10 is compiled as follows: 

xppgcc -o' streamfir streamfir.c 

The resulting program streamUr wWl read input data from port1j0.dat and write its 
1 5 results to port4j0, daf, 

6. Examples 

6.1 Stream Access 

20 

The following program streamfir.c is a small example showing the usage of the 
XPP_get stream and XPP_put stream functions. The infinite WHILE-loop im- 
plements a small FIR filter which reads input values from port I J) and writes out- 
put values to port 4_0. The variables xd, xdd and xddd are used to store delayed 
25 input values. The compiler automatically generates a shift-register-like configura- 
tion for these variables. Since no operator dependencies exist in the loop, the 
loop iterations overlap automatically, leading to a pipelined FIR filter execution. 



^ However, programs receiving Initial data from or writing result data to external RAMs in xsim 
cannot be compared to directly compiled programs using xppgcc. The results may also differ if a 
bitwidth other than 32 is used for the generated NML files. 
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1 # include '"XPP.h" 
2 

3 mainO { 
5 4 int X, xd, xdd, xddd; 
5 

6 X = 0; 

7 xd = 0; 

8 xdd = 0; 

10 9 while (1) { 

10 xddd = xdd; 

11 xdd = xd; 

12 xd = x; 

13' XPP_getstreain(l, 0, &x) ; 
15 14 XPP:j)utstream(4, 0, (2*x + 6*xd + 6*xdd + 2*xddd) » 4); 

15 } 

16 } 



20 After generating streamfinxbin with tlie command xppvcmake streamf ir . xbin, 
tlie following command reads tiie input file port1__0.dat and writes the simulation re- 
sults to xpp _port4_0.dat 

xsim -run 2000 -inl_0 portl_0.dat -out4_0 xpp_port4_0.dat 
25 streamf ir .xbin > /dev/null 

xpp j3Ort4_0.dat can now be compared with port4_0.dat generated by compiling the 
program with xppgcc and running it with the same port1j0.dat. 

30 6.2 Array Access 

The following program arrayfir.c is an FIR filter operating on arrays. The first FOR- 
loop reads Input data from port 1_0 into array x, the second loop filters x and writes 
the filtered data into array y, and the third loop outputs y on port 4_0. 
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1 #include ^^XPP.h^' 

2 #define N 256 

3 int x[N], y[N]; 

5 4 const int c[4] = { 2, 4, 4, 2 }; 

5 main ( ) { 

6 int i, j, tmp; 

7 for (i = 0; i < N; i++) { 

8 XPP_getstream(l, 0, &tmp) ; 
10 9 x[i] = tmp; 

10 } 

11 for (i = 0; i < N-3; i++) { 

12 tmp = 0; 

13 XPP_unroll ( ) ; 

15 14 for (j = 0; j < 4; j++) { 

15 tmp += c[j]*x[i+3-j] ; 

16 } 

17 y[i+2] = tmp; 

18 } 

20 19 for (i = 0; i < N-3; i++) 

20 XPP_j5ut stream (4, 0, y[i+2]); 

21 } 



25 



xppvcmake produces the following output 



$ xppvcmake arrayfir.nml 

pscc -I/home/wema/xppc/include -parallel -no PORKY_FORWARD_PROP4 

-.spr arrayfir.c 
porky -dead-code arrayfir.spr arrayf ir-spr2 
30 partition arrayf ir.spr2 arrayf ir.svo 
Program analysis: 

main: FOR-LOOP i, line 7 can be synthesized/vectorized 
main: FOR-LOOP j, line 14 can be synthesized/unrolled/vectorized 
main: FOR-LOOP i, line 11 can be synthesized/vectorized 
35 main: FOR-LOOP i, line 19 can be synthesized/vectorized 
main: can be synthesized completely 
Program partitioning: 

Entire program selected for NML module synthesis . 
main: FOR-LOOP i^ line 7 selected for pipeline synthesis 
40 main: FOR-LOOP i, line 11 selected for pipeline synthesis 
main: FOR-LOOP i, line 19 selected for pipeline synthesis 
. . .unrolling loop j 
porky -const-prop -scalarise -copy-prop -dead-code arrayf ir.svo 
arrayf ir. svol 
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predep -normalize arrayfir . svol arrayf ir - svo2 

porky -ivar -know-boiinds -fold arrayf ir . svo2 arrayf ir.sur 

nmlgen arrayfir. sur arrayf ir.xco 

5 

The messages from partition show that all loops can be vectorized. The de- 
pendence analysis did not find any loop-earned dependencies preventing vectori- 
zation. The inner loop in the middle of the program is unrolled. The outer loop's 
body is effectively substituted by the following statement: 

10 

y[i+2] = c[0]*x[i+3] + c[l]*x[i+2] + c[2]*x[i+l] + c[3]*x[i]; 

Since all remaining loops are innermost loops, they are selected for pipeline syn- 
thesis. Array reads, computations, and array writes overtap. To reduce the num- 
1 5 ber of an^y accesses, the compiler automatically removes redundant array reads. 
In the middle loop, only x[i+3] is read. Forx[i+2] , x[i+13 and x[il, de- 
layed versions of x [i+3] are used, fonning a shift-register. Therefore, each loop 
iteration needs only one cycle since one read from x, all computations, and one 
write to y can be executed concurrently. 

20 

Finally, the following example program fragment is a 2-D edge detection algo- 
rithm. 

/* 3x3 horiz. + vert, edge detection in both directions */ 
25 for(v=0; v<=yERLEN-3; v++) { 

for{h=0; h<=H0RLEN^3; h++) { . 

htmp = (pl[v+2][h] -.pl[v][h]) + 

(pl[v+2] [h+2] - pl[v] [h+2]) + 
2 * (PI [v+2] [h+1] - pl[v][h+l]); 
30 if (htmp < 0) 

htmp = - htmp; 

vtmp = (pi [V] [h+2] - pl[v][h]) + 

(pi [v+2] (h+2] -pl[v+2][h]) + 
35 2 * (pi [v+1] [h+2] - pl[v+l] [h]); 

if (vtmp < 0) 
vtmp = - vtmp; 
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sum = htmp + vtiap; 
if (sum > 255) . 

sum = 255; 
p2iv+l][h+l] = sum; 

5 } 

} 

As the output of partition shows, both loops can be vectorized. Since only 
innermost loops can be pipelined, the outer loop is executed sequentially. (Note 
10 that the line numbers in the program outputs are not obvious since only a pro- 
gram fragment is shown above.) 

partition edge.spr2 edge.svo 
Program analysis: 
15 main: FOR-LOOP h, line 22 can be synthesized/can be vectorized 

main: FOR-LOOP line 21 can be synthesized/can be vectorized 
main: can be synthesized completely 
Program partitioning: 

Entire program selected for XPP module synthesis. 
20 main: FOR-LOOP h, line 22 selected for pipeline synthesis 

main: FOR-LOOP v, line 21 selected for synthesis 

Also note the following additional features of this program: Address generators for 
the 2-D array accesses are automatically generated, and the array accesses are 
25 reduced by generating shift-registers for each of the three image lines accessed. 
Furthennore, the conditional statements are implemented using SWAP (MUX) 
operators. Thus the streaming of the pipeline is not affected by which branch the 
conditional statements take. 

30 7. Future Compiler Extensions 

Apart from removing some of the restrictions of Section 3.1.2, the following exten- 
sions are planned for XPP-VC. 
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7.1 Temporal Partitioning 

By using the pragma function XPP_next.confO, programs are partitioned into sev- 
5 era! configurations wliich are loaded and executed sequentially on the XPP Core. 
Specific NML configuration commands are generated which also exploit XPP's 
sophisticated configuration and preloading capabilities. Eventually, the temporal 
partitions will be determined automatically. 

10 7.2 Program Transformations 

For more efficient XPP configuration generation, some program transformations 
are useful. In addition to loop unrolling, loop merging, loop distribution and loop 
tiling will be used to improve loop handling, i. e. enable more parallelism or better 
15 XPP usage. 

Furthemiore, programs containing more than one function could be handled by 
inlining function calls. 

20 7.3 Codesign Compiler 

This section sketches what an extended C compiler for an architecture consisting 
of an XPP Core combined with a host processor might look like. The compiler 
should map suitable program parts, especially inner loops, to the XPP Core, and 
25 the rest of the program to the host processor. I. e., it is a host/XPP codesign 
compiler, and the XPP Core acts as a coprocessor to the host processor. 

This compiler's Input language is full standard ANSI C. The user uses pragmas to 
annotate those program parte tfiat should be executed by the XPP Core (manual 
30 partitioning). The compiler checks if the selected parts can be implemented on the 
XPP. Program parte containing non-mappable operations must be executed by 
the host. 
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The program parts running on the host processor ("SW"), and the parts running 
on the PAE array ("XPP") cooperate using predefined routines 
{copyjdataJto^XPP, copy^datajtojiost, start_config(n), 
waitjFor_copmcessorJinish(n), request_config(n)). For all XPP program parts, 
5 XPP configurations are generated. In the program code, the XPP part n is re- 
placed by request_config(n), start_config(n), waitJtor_copmcessor_pnish(n), and 
the necessary data movements. Since the SUIF compiler contains a C backend, 
the altered program (host parts with coprocessor calls) can simply be written back 
to a C file and then processed by the native C compiler of the host processor. 

10 

Thus the sequential control flow of the C program defines when XPP parts are 
configured into the XPP Core and executed. 
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10 Abstract 

The extreme Processing Platform (XPP) technology offers a unique reconfigu- 
rable computing platform supported with a set of tools. A C compiler, which inte- 
grates both new and efficienVcompilation techniques and temporal partitioning, is 

1 5 presented. Temporal partitioning guarantees the compilation programs with un- 
limited complexity as long as the supported Osubset is used. A new partitioning 
scheme, which pennits to map large loops of any kind and, is neither constrained 
by loop-dependencies nor nested stmctures, is also presented. Furthermore, 
temporal partitioning is applied to reduce the configurafion time overhead and 

20 thus can lead to performance gains. The compilation from C code to the configu- 
ration data, ready to be downloaded onto the XPP, takes seconds for complex 
examples, which is, as fares we know, not reproduced by any other reconfigu- 
rable computing technology. The compiler represents a step forward, by furnish- 
ing a truly ''push-button" approach only comparable to microprocessor domains, 

25 and Uius can be spread the use of the XPP technology and deal with time-to- 
market pressures positively. 

1. Introduction 

30 Many of today's applications are characterized by intensive data-stream process- 
ing and high-performance requirements. Such performance is more and more 
evident to not be accomplished with today's microprocessor technology. Conven- 
tional processors (including DSPs) are geared for sequencial processing. Multl- 
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DSP and very large instruction word (VLIW) processors still have severe memory 
bottlenecks, lack the number of data ports required to support multi-channel, high 
speed data streams, and fail on furnishing low power consumption solutions. Ac- 
celerating specific functions using application-specific integrated circuits (ASICs) 

5 relieves some of the processing burden, adds some required features, but limits 
flexibility and requires expensive non-recurring engineering (NRE) costs and long 
design cycles. High density field-programmable gate arrays (FP-GAs) eliminate 
the NRE costs, add flexibility, but still require long timing optimizations and verifi- 
cation cycles and low level hardware efforts. Additionally, the fine-grained struc- 

10 ture adopted in FPGAs is not suitable to map at the algorithmic level, which is 
proved by the well-known difficulties to have a "push-button" high-level methodol- 
ogy to program these architectures. 

New reconfigurable processing units (RPUs) are being introduced trying to solve 
15 those problems [1]. One of the new promising architectures is the XPP [21[3]. The 
XPP is a coarse-grained, runtime-reconfigurable, 2-D array parallel structure. The 
architecture was designed to facilitate programming and to support pipelining, 
dataflow computations, and parallelism from the instruction to the task level effi- 
ciently. Therefore, this technology is well suited for applications in multimedia, 
20 telecommunications, simulation, digital signal processing, and similar stream- 
based application domains. The XPP architecture also supports dynamic self- 
reconfiguration in a user transparent way. In order to drastically reduce the time to 
program the XPP, and to keep the user from architecture details, a high-level 
compiler integrating temporal partitioning is required. Such a compiler is the main 
25 topic of this paper. 

This paper is organized as follows. The next section introduces briefly the XPP 
technology. Section 3 outlines compilation to the XPP and section 4 describes the 
temporal partitioning steps. Section 5 shows some experimental results, section 6 
30 points out the main differences between this and previous works, and finally sec- 
tion 7 concludes the paper and enumerates ongoing and future work planned. 
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2. XPP Technology 

The XPP technology consists of a reconfigurable computing platform delivered as 
5 a device or an intellectual property (IP) core, and a complete development tool 
suite (XDS) [2]. An XPP can be used as a coprocessor for CPU and DSP archi- 
tectures. A prior version of the technology has resulted in the XPU128-ES [3], a 
prototype device, which was produced in silicon. 

10 The XPP architecture is based on a hierarchical an-ay of coarse-grain, adaptive 
computing elements, called Processing Array Elements (PAEs), and a packet- 
oriented communication network. The strength of the XPP technology originates 
from the combination of array processing with unique and powerful run-time re- 
configuration mechanisms. Different tasks or applications can be configured and 

15 run independently on different parts of the an^y. Reconfiguration is triggered ex- 
temally or even by special event signals originating within the anay, enabling self- 
reconfiguring designs. By utilizing protocols implemented in hardware, data and 
event packets are used to process, generate, decompose and merge streams of 
data. 

20 

2.1 Array Structure 

An XPP contains one or several Processing An^y Clusters (PACs), i. e., rectan- 
gular blocks of PAEs. Fig. 1 shows the structure of a typical XPP device. It con- 

25 tains four PACs (see top left-hand side). Each PAC is attached to a Configuration 
Manager (CM) responsible for writing configuration data into the configurable ob- 
jects of the PAC using a dedicated bus. Multi-PAC XPPs contain additional CMs 
for configuration data handling, forming a hierarchical tree of CMs. The root CM is 
called ttie supervising CM (SCM). It has an extemal interface (dotted arrow origi- 

30 nating from the SCM in Fig. 1) which usually connects the SCM to an external 
configuration memory. A CM consists of a state machine and internal RAM for 
configuration caching (see.top right-hand side of Fig.1). 

19 
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Horizontal busses carry data and events. They can be segmented by configurable 
switch-objects, and connected to PAEs and special I/O objects at the periphery of 
the device. The I/O objects can be used for data-streaming or to access extemal 
resources (e.g., memories). A column of ports to the corresponding leaf CM is 
5 located on the array. A CMPort can be used to send events to the CM from the 
array. The typical PAE shown in Fig. 1 (bottom center) contains three objects: one 
FREG (forward register), one BREG (backward register) and one ALU. The 
FREG object is used for vertical fonA^ard routing (with a programmable number of 
register stages), or to perform MERGlE, SWAP or DEMUX operations (for con- 

10 trolled stream manipulations). The BREG object is used for vertical backward 
routing (registered or not), or to perform some selected arithmetic operations 
(e.g., ADD, SUB, SHIFT). The BREGs can also be used to perform logical opera- 
tions on events. Each ALU (see its internal structure on the bottom left-hand side 
of Fig.1) performs common two-input fixed-point arithmetical and logical opera- 

15 tions, and comparisons. A MAC (multiply and accumulate) operation can be per- 
formed using the ALU and the BREG objects of one PAE in a single clock cycle. 

Another standard PAE object Is the memory object which can be used in FIFO 
mode or as RAM for lookup tables, intermediate results, etc. If such objects are 
20 needed they are located in the left and/or right columns of PAEs of each PAC. 
However, any PAE object functionality can be included in the XPP architecture. 

A set of parameterizable features can be used to furnish an XPP that best fits to 
user and application demands. Those features include: the number of PACs and 
25 their PAEs, number of internal memories, number of I/O pori:s, number of buses, 
word bitwidth, cache size, depth of the FIFO to configure each object, etc. 
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Figure 1 : XPP architecture. 
2.2 Packet Handling and Synchronization 

PAE objects as defined above communicate via a pacl^et oriented networl<. Two 
types of pacl<ets are sent through the anray: data and event packets. Data pacl^- 
ets have a unifonn bitwidth specific to the XPP Core or device. 



In normal operation mode, PAE objects are self-synchronizing. An operation Is 
25 perfonmed as soon as all necessary data input packets are available. The results 
are fonA^arded as soon as they are computed and the previous results have been 
consumed. Thus, a signal-flow graph can be mapped directly to the ALU objecte 
and data-streams can flow through them In a pipelined manner without adding 
specific hardware. 

30 

Event packets are one bit wide. The transnriit state infomiation which controls ALU 
execution and packet generation. For instance, they can be used to control the 
merging of data-streams or to deliberately discard data packets. Thus, conditional. 
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computations depending on the results of earlier ALU operations are feasible. 
Events can even trigger a self-reconfiguration of the device as explained below. 

Each data or event pacl<et is only fonwarded If the previous one has already been 
5 consumed. The communication system was designed to transmit one packet on 
each interconnect per cycle. Hardware protocols ensure that no packets are lost, 
even in the case of pipeline stells or during the configuration process. This simplifies 
application development considerably. No explicit scheduling of operations is 
required. 

10 

2.3 Configuration 

The XPP architecture is optimized for rapid and user-transparent configuration. For 
this purpose, the configuration managers in the CM tree operate independently 

1 5 (without global synchronization), and therefore are able to configure their respective 
parts of the anray in parallel. Every PAE stores locally its current configuration state, 
i.e.. if it is part of a configuration or not (states "configured" or "fi-ee"). Once a PAE is 
configured, it changes its state to "configured". This prevents the respective CM from 
reconfiguring a PAE which is still in use. The CM caches the configuration data in its 

20 internal RAM and constantly tries to configure the objects used by the next configu- 
ration requested. Each XPP object has a configuration FIFO which stores data 
of subsequent configurations. Once an object has been released (state "free"), 
the next configuration word in its FIFO is loaded immediately. Hence it is possible to 
reconfigure partially in one clock cycle. Additionally, a prefetching mechanism is 

25 used. While a configuration is being loaded onto the FIFO of each object, other con- 
figurations may already be requested and cached in the low-level CMs' intemal 
RAM. Thus, it does not need to be requested ail the way from the SCM down to the 
an-ay when objects become available. While loading a configuration, its PAEs 
start their part of the computations as soon as Ihey are in state "configured". 

30 

Each ALU object has an input event port that triggers the self-releasing of its re- 
sources and of all of the objects connected to it. Such event is successively broad- 
casted according to the Interconnections. 
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Because of its course-grain nature, an XPP device can be configured rapidly. Since 
only tlie configuration of those array objects actually used is necessary, the configu- 
ration time depends on the application. 

5 

2.4 Development Tools 

The XPP can be programmed by using the Native Mapping Language (NML) [2], a 
PACT proprietary structural language with reconfiguration primitives. It gives the 

10 programmer direct access to all hardware features. In NML, configurations consist of 
modules which are specified as in a structural hardware description language, simi- 
lar to, for instance, structural VHDL PAE objects are explicitly allocated, optionally 
placed, and their connections specified. Additionally, NML includes statements to 
support configuration handling. Thus, configuration handling is an explicit part of 

1 5 the NIML application program. XDS is an integrated environment for programming 
with NML The main component is the mapper xmap which compiles NML source 
files, places and routes the objects, and generates XPP binary files, xmap uses an 
enhanced force-based placer with short runtimes. The XPP binaries can eitiier be 
simulated and visualized cycle by cycle with the xsim and xvis tools, or directly 

20 executed on an XPP device. A high-level compiler, described in the next section, 
has been added to XDS and permits to map C programs onto the XPP. 



2.5 Application Execution on XPP 



25 Reconfiguration and prefetching requests can be issued by any CM in the tree. (In- 
cluding the SCM which can respond to external requests) and also by event signals 
generated in the array itself. Running modules can do a self-releasing of their 
resources and request another configuration. Thus, it is possible to execute an 
application consisting of several configurations without any extemal control. 
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The CM of the XPP permits to exploit speculative configuration^, i.e., the con- 
figuration of a module possibly used after the current one has finished execution. 
If the path which includes that module is taken, the CM only has to trigger the 
5 execution of the configuration (See the section of the NML code in Fig.2 and the 
simulation perfomied with xsim in Fig.3, where conf_MOD2 is speculatively con- 
figured during the execution of conf_MOD0)- If this path is not taken, the CM trig- 
gers the releasing of the resources already configured and requests the other 
configuration. 

10 

3. Compiling C Code with XPP-VC 

The XPP Vectorizing C Compiler XPP-VC is based on the SUIF compiler frame- 
work [4]. SUIF is used because of its easily extensible properties. The XPP-VC 
1 5 compilation flow is shown in Fig.4. An options file, used by the compiler, specifies 
the parameters of the targeted XPP and the extemal memories connected to the 
XPP. To access XPP I/O ports specific C-functions are provided. 



20 CONFIG conf_MOD0 { 

CONF_MODULE(MOD0) // request the configuration of MODO 
REQUEST(conf_MOD2_spec)// start speculative configuration 
// if (MODO.CMPortO == "0") then conf_MOD2_exec is requested 
CONF_CMPORT(MODO.CMPortO, conf_MOD2_exec,J 

25 // if(MOD0.CMPort1 == "0") then conf_MOD1 is requested 
CONF_CMPORT(MOD0.CMPort1 , conf_MOD1, J 

} 

CONFIG conf_MOD2_spec { // request the configuration of MOD2 
CONF_MODULE(MOD2) // but do not start it 

30 } 

CONFIG conf_MOD2_exec { // MOD2 is taken 
SET(MOD2.Start.A = 1) // enable the start of computing of MOD2 
REQUEST(conf_MOD3) // request the next configuration 

} 

35 CONFIG conf_MOD1 { // MOD1 is taken 



^ This has similarities to speculative execution. In this case, before knowing if a configuration will 
be requested, its configuration is started. 
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REQUEST(conf_MOD2_rec) // releasing of resources of MOD2 
CONF_MODULE(MOD1) // request the MOD1 
REQUEST(conf_MOD3) // request the next configuration 

} 

5 CONFIG conf_MOD2_rec { 

RECONF(MOD1 .Start) // release the resources of MOD1 

} 

1 0 Figure 2: Section of NML describing the control flow. 

The compiler starts with some architecture-independent preprocessing passes 
based on well-known compilation techniques [5]. During this step, FOR loops are 
automatically unrolled if instructed by the programmer. Then the compiler per- 
15 forms a data-dependence analysis. The compiler tries to vectorize inner program 
FOR-loops. In XPP-VC, vectorization means that loop iterations are overiapped 
and executed In a pipelined, parallel ^shion. This technique is based on the 
Pipeline Vectorization method developed for reconfigurable architectures [6]. 

20 The C program can be manually splitted in several modules by using annotations. 
Otherwise, automatic temporal partitioning can be applied (see section 4) in order 
to fumish mappable modules and to reduce the overall latency. 

MODGen generates one NML module for each temporal partition. First, program 
25 data is allocated on the XPP. By default, MODGen maps each program an^y to 
internal or extemal RAM while scalar variables are stored in registers within the 
PAEs. Next, a control/dataflow graph (CDFG) is generated. Straight-line . code 
without array accesses can be directly mapped to a data-flow graph since the 
data dependencies are obvious in the DAG representation. One ALU is allocated 
30 for each operator in the CDFG. Because of the self-synchronization of operators 
on the XPP, no explicit control or scheduling is needed. The same is true for con- 
ditional execution of such blocks. Both branches are executed in parallel and 
MUX operators select the correct output (and discard the other one) depending 
on the condition. This data-driven execution of the operators automatically yields 
35 instruction-level parallelism. In contrast, accesses to the 
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Figure 3: Speculative configuration (enables earlier activation of MOD2). 



same array have to be controlled explicitly to maintain the conrect execution order. 

20 MERGE operators (which select one input without discarding the other one) route 
address and write data packets in the correct order to the RAM, and DEMUX op- 
erators route read data packets to the con-ect subsequent operator. State ma- 
chines for generating the correct sequence of event signals (to control these op- 
erators) are synthesized by the compiler. For conditional branches, containing 

25 anray accesses or inner loops, DEMUX operators controlled by the IF condition 
route data packets only to the selected branch, and output values are taken from 
the branch activated. Thus, only selected branches receive data packets and 
execute. 



30 In loops, all variables updated in the loop body are handled as follows. The first 
iteration uses an Input packet for the variable's value, and the subsequent itera- 
tions use packets generated in tiie previous iteration. In all but tiie last iteration, a 
DEMUX operator routes the outputs of the loop body back to the body inputs. 
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Only the results of the last Iteration are routed to the loop output by the DEMUX 
Operators. The control packets for the DEMUX are generated by the loop counter 
or the comparator evaluating the exit condition. Note that the internal operators' 
outputs cannot just be connected to subsequent operators since they produce a 
6 result in each loop iteration. The required last packet would be hidden by a 
stream of intennediate packets. If array accesses are present, a loop iteration 
may only be started after the previous iteration has terminated because the origi- 
nal access order must be maintained. This is enforced by event signals. 

10- For generating more efficient XPP configurations, MODGen generates pipelined 
operator networks for Inner pro- 
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gram loops which have been annotated for vectorization by the preprocessing 
step. In other words, subsequent loop iterations are started before previous itera- 
tions have finished. Data packets flow continuously through the operator pipe- 
lines. By applying pipeline balancing techniques, maximum throughput is 

5 achieved. For many programs, additional performance gains are achieved by the 
complete loop unrolling transformation. Although unrolled loops require usually 
more XPP resources, they yield more parallelism and better exploitation of the 
XPP. To reduce the number of array accesses, the compiler automatically re- 
moves redundant array reads. When an^ay references inside loops access subse- 

1 0 quent element positions the compiler only uses one reference and generates de- 
layed structures, forming shift-registers. 

Finally, each module generated by MODGen is placed and routed automatically by 

xmap. 

15 

The XPP-VC compiler currently supports a C-subset sufficient for programming 
real applications, struct data types, pointer operations, irregular control flow 
(break, continue, goto, label), and recursive and operating system calls 
are not supported or cannot be mapped to the XPP. 

20 

4. Temporal Partitioning 

A program too large to fit in an XPP can be handled by splitting it in several parts 
(configurations) such that each one is mappable. Temporal partitioning pemnits 

25 the automatic exposing of configurations such that the overall execution time, of 
the application is minimized and is successfully mapped onto the XPP resources. 
It considers the costs to load into the cache, to configure and to execute each 
configuration with the XPP. An important strategy that is considered is to pre-fetch 
configurations while another is being configured or is running. Arrays of constants 

30 or with pre-defined values used in one or more configurations can be initialized in 
parallel with the execution of the previous configurations. This takes advantage of 
the initialization of the array earned out by using the configuration bus. 
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The set of partitions resulting from the splitting are then processed by MODGen, 
generating a set of configurations. Next, specific NML configuration commands 
are generated which also exploit XPP's sophisticated configuration and pre- 
fetching capabilities, and specify the configuration control flow that is orchestrated 
5 by the CM. 

4.1 Benefits of Temporal Partitioning 

Temporal partitioning targeting the XPP can reduce, when efficiently applied, the 
10 overall execution time. Such reduction can be mainly achieved by the following 
issues: (1) reduction of each partition complexity can reduce the interconnection 
delays (long interconnections may pass through registers and thus add clock cy- 
cle delays); (2) reduction of the number of references, in the section of the pro- 
gram related to each partition, using the same resource, by distributing the overall 
15 references among partitions, can lead to performance gains as well. This hap- 
pens with the statements presented In the program referring the same array; (3) 
reduction of the overall configuration overhead by overlapping fetching, configu- 
ration and execution of distinct partitions. 

20 Example: Consider the C example max avg shown in Fig.5. 

Configuration boundaries are represented by XPP_next_confO statements. They 
define four configurations in the code (see Fig.6). Apart from exposing temporal 
partitions in such a way tiiat the mapping to XPP Is accomplished, combining only 
the most firequently taken conditional paths in tiie same partition can reduce the 

25 total execution time by substantially reducing tiie reconfiguration time (since the 
partitions for tiie other paths are not configured when they are not taken). Fig.6 
presents such a case. If the patii bb_0 and bbj has been identified as the most 
frequentiy executed, this path can be in tiie same partition^. In such a case, tiie 
configuration related to bb_2 will only be called when the most frequent path has 

30 not been taken. 



® Tail duplication of bb_3 would permit to have a configuration with {bb_0, bb_1, bb_3} and an- 
other one with {bb_2. bb_3}; 
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// max_avg example 

if {op==l) { // average kernel 
svm =0; 
5 for(i=0;i<N; i++) { 

suin+=x [i] ; 

} 

average = smn/N; 
XPP_next_conf ( ) ; 
10 } else { // max kernel 

XPP_next_conf { ) ; * 
max = 0; 

for{i=0;i<N; i++) { 

if(x[i] > max) max = x[i]; 

15 } 

XPP_next_conf { ) ; 

} 



20 Figure 5: 

Example with two conditionally executed kernels and with configuration bounda- 
ries represented. 




Figure 6: CFG of the algorithm shown in Fig. 5, Lines crossing edges represent 
the XPPjiextjconfO statements in the code. Bubbles containing basic blocks 
represent the regions to be implemented in different partitions. 

Since configuration takes many clock cycles, it is in most cases preferable to re- 
use a configuration as much long as possible in order to reduce the reconfigura- 

30 
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tion time overhead. Thus, loops in the source code are always good candidates to 
be entirely implemented by a single configuration. 

4.2 Partitioning Loops 

5 

Each loop that does not fit onto the XPP can be dealt with by performing loop dis- 
tribution [5] (if applicable) or by partitioning the loop and use the CM to orches- 
trate the control flow. Currently, loop distribution is not automatically applied, in- 
stead, we propose a new method to partition complex loops without restrictions. 

10 All the loops which their bodies must be partitioned are transfomned into straight 
line code with a jump to loop-exit or to the next iteration in order that each parti- 
tion can be compiled by MODGen. Fig.7 shows an example of such transformation 
without the statements needed to communicate the value of scalar variables be- 
tween configurations. Each configuration requests the next configuration to be 

1 5 taken (if none is, requested then the application terminates and the last configura- 
tion releases its resources). Depending on the value of the i <n condition, con- 
fig. #2 takes two different exits, which requests #3 or #4 respectively. Since con- 
fig. #3 always requests #2, at the end of its execution, the initial behavior of the 
loop is preserved. The temporal partitioning creates two additional configuration 

20 boundaries to preserve the initial functionality. From Fig. 7b can be seen that 

configuration boundaries were inserted before and after the if statement. These 
boundaries are needed since the code before and after will be executed once and 
both the if header and body will iterate N+l and N times respectively. 



25 int i; int i; 

i=0; #1 

for (i=0;i<N;i++) { labl: if(i<N) { #2 

stmtl; stmtl; #2 

XPP_next_conf ( ) ; stint2 ; #3 

30 stmt 2; #3 

} goto labl; #3 

stmts; } #3 



stmts; #4 
a) b) c) 



35 
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Figure 7: Example of the transformation applied for partitioning loops, a) original 
code added with the statement representing where the loop is partitioned: b) 
transformed code; c) configuration ID for each statement in b). 

5 4.3 Automatic Partitioning 

From the SUIF representation of the C source code the temporal partitioning 
phase constructs an Hierarchical Task Graph^ extended. HTG+. This extended 
graph has two types of nodes: (1) behavioral nodes representing lines of code in 

10 the input program; (2) array nodes representing each array existent in the source 
- code. For instance, Fig.8 shows the top level of the HTG+ for an implementation 
of the DCI (Discrete Cosine Transform) based on matrix multiplications. Type (1) 
nodes have three distinct sub-types: (a) block nodes representing basic blocks; 
(b) compound nodes representing if-then-else structures; (c) loop nodes rep- 

15 resenting the loops (for, while). Loop and compound nodes explicitly embody hi- 
erarchical levels. Edges in the HTG+ represent data communication between two 
nodes or just enforce execution's precedence. 

Each behavioral node of the HTG+ is labeled with the following information (some 
20 of the labeling steps require estimation efforts): (1 ) block and compound nodes: 
number of ALUs and REGs; (2) loop nodes: number of iterations (if unbound, pro- 
filing can be used), and number of ALUs and REGs; (3) anay nodes: the size of 
the an^y, type of the elements, and, when they do exist, the initialization values. 
Each edge between two behavioral nodes of the HTG+ is labeled with the number 
25 of data words that must be transfenred between the two nodes; Each edge be- 
tween an array and a behavioral node in the HTG+ is labeled with the number of 
load and store references in the source code represented by the behavioral node 
to that particular array. The estimated number of times that each load and store 
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reference will be executed is also collected. The use of the same array by differ- 
ent behavioral nodes, increases the execution latency and the number of re- 
sources needed for this partition °. 

5 TempPart uses three types of estimations: (1) number of XPP resource units 
needed by the configuration implementing a single or a set of behavior nodes; (2) 
latency for a behavior node or a set of connected behavior nodes on the HTG+ 
(this does not need to be accurate to the real execution time and only needs to 
have relative accuracy); (3) number of clock cycles to fetch and configure each 

10 partition (calculated based on the number of configuration words needed, which is 
computed with the estimation of the resources needed directly from the SUIF rep- 
resentation or with the number of edges, ALUs, REGs, and predefined values 
existent in the NML graph generated by MODGen). 

15 



20 



25 




Figure 8: Top level of the HTG+ for the DCT example (this top level consists of 4 



' The model has been chosen, isecause it also exposes loop and task level paralelism. 
' E.g.. twice the number of references to the same RAM leads to more than twice the number of 
objects required on XPP and delays each access because of the objects needed to MERGE and 
DEMUX data and address packets. Hence, combining several behavioral nodes In one partition 
incurs an overhead which, is computed during the temporal partitioning algorithm. 
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loops). Circles and boxes represent behavioral and array nodes respectively. 
Data Is read from an input port (Loop1) and written to an output port (Loop4). 



5 The temporal partitioning algorithm starts with a partition for each node on the top 
of the HTG+ and then merges iteratively adjacent partitions until no performance 
gains are achieved considering the maximum available size for each partition. 
Each partition must currently define, on the control flow graph (CFG) of the pro- 
gram, regions of code with ail entries to the same instruction and possibly multiple 
10 exists. The algorithm considers the overlapping of configuration and execution 
with fetch during the merging of partitions. The algorithm starts with the granular- 
ity of the nodes in the HTG+ and only if a block node cannot be mapped it con- 
siders partitioning at the statement or sub-block level. Thus, the granularity of the 
algorithm adapts according to the application needs. 

15 

The temporal partitioning strategy only exploits configuration boundaries inside 
loop bodies If an entire loop cannot be mapped to the XPP or contains more than 
one inner loop In the same level of the loop body. If these cases occur, the algo- 
rithm is applied hierarchically to the body of the loop. 
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Figure 9: Automatic temporal partitioning methodology. 

Fig.9 shows the metliodology which uses three levels (the computational efforts 
increase from the first to the third level): (1) Temporal Partitioning algorithm based 
on the estimation of the needed resources done with function costs based on the 
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number and kind of operations in the source code. Tiie algorithm uses the HTG+ 
and the SUIF representation of the program; (2)For each configuration, selected 
in the first level, the estimated sizes are checked with the ones estimated by gen- 
erating the NML graph with MODGen. If the size surpasses the available re- 

5 sources, the algorithm rerun level (1), relaxing the size constraint (dimlnuishing 
the maximum number of available resources); (3) check if each configuration suc- 
cessfully checked in level (2) can be really mapped to the XPP. This level uses 
functions of the mapper, placer and router. If the configuration cannot be imple- 
mented in the XPP, the algorithm returns to level (1), once more relaxing the size 

1 0 constraint. The size constraint is relaxed by reducing the alpha parameter in each 
backward iteration (see Fig. 9) 

After exposing the configurations with TempPart, the compiler introduces the 
statements needed to communicate scalar variables between partitions (see Fig. 

15 10). Arrays are used as inter-partition storage for scalar variables too, since only 
RAMs (to which the an^ys are mapped) keep their data during reconfiguration. 
TempPart also ensures that arrays used by more than one configuration, or by 
the same configuration loaded more then once onto the XPP, are bound to the 
same memory location and such location is not used by other an^ays during the 

20 lifetime of the an-ay variable. The assignment of all arrays (the initially used in the 
source code plus the added ones to communicate data) to the internal memories 
is done based on the lifetimes of the arrays determined by the sequence of con- 
figurations that were previously exposed in the input program. This pemiits, in 
some cases, to use less internal memories since they can be time shared, among 

25 different configurations. 



int 



comm [ 1 ] ; 



30 



XPP 
d = 



a = 



a) 



b * c; 

next_conf ( ) ; 
a/e; 



a = b * c; 
coinni[0] = a; 
a = coinm[0] ; 
d = a/e; 
b) 



#1 

#1 
#1 
#2 
#2 
c) 
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Figure 10: Example Illustrating the communication of the value of a scalar variable 
between two configurations, a) source code; b) code with statements inserted to 
5 buffer the data; c) configuration ID for each of the statements in b). 

4.4 Generating the NML Application 

Each partition is input to MODGen. which generates the NML structure to be 
10 mapped to the XPP. MODGen generates, for each exit point existent in each parti- 
tion, an event connected to one of the CM ports available in the XPP (the CM can 
check if an event is generated and can proceed with different configurations 
based on the value of the event). The compiler generates both the NML repre- 
sentation of each partition and the NML section specifying the control flow of con- 
15 figurations. Such control flow is orchestrated by the CM of the XPP during run- 
time, as has been already explained. 

The compiler also generates NML code considering the pre-fetch (load of a con- 
figuration to the cache of the XPP) of configurations. The compiler can furnish two 

20 different strategies: (1 ) request of the pre-fetch of all configurations existent in the 
application in the start of the execution; (2) request in each configuration of the 
pre-fetch of the next. The request is done before the start of the configuration step 
for the current configuration. Strategy (1 ) is used most of the times. However, 
there are cases where using (2) is better In the presence of several nested if- 

25 then-else Structures with different configurations for each branch, a pre-fetch 
sequence defined at compile time can introduce too much overhead. 

5. Experimental Results 

30 Tab. I shows some results obtained when compiling a set of benchmarks with the 
XPP-VC. Note that none of the examples shown was specially coded to ex- 
ploit more efficiently the architectural features of the XPP (e.g., partitioning 
and distribution of arrays among the internal memories) and thus the results can 
be further improved. An XPP Core with a single PAC was used. The 2nd column 
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represents the size of the PAC (number of colunnns and rows of PAEs) used for 
each exannple. Colunnns #cf, #PAE, #Lat, and #max represent the number of con- 
figurations, number of PAEs used (it is shown the maximum number of PAEs of 
the largest configuration and the total number of PAEs virtually needed), overall 
5 latency (taken Into account setup, fetching, configuration, data communication 
and execution), and the maximum number of objects executing per cycle respec- 
tively. The last column shows the CPU time (using a Pentium III @933MHz with 
Linux) to compile each example (from the source program to the generation of the 
binary configuration file). 

10 

DCT1 Is a 8x8 discrete cosine transform implementation which is based on two 
matrix multiplications. The algorithm uses 6 loops for the multiplications and 2 
loops to stream I/O data. It is purely sequential (no unrolling is used). Temporal 
partitioning improves the overall latency of DCT1 by 13% and uses 31 PAEs 

1 5 (without partitioning 51 PAEs are used). Thus it can use a smalter XPP core. 
DCT2 uses the DCT kemel of DCT1 and traverses an input Image of a pre- 
defined size (16x16 Is used). It uses 2 external memories to load/store the image 
and 2 internal RAMs for intermediate results and to store the coefficients. The 
version with 6 configurations was obtained performing temporal partitioning. Since 

20 the example has two outer loops the scheme to partitioning loops was applied 
(the compiler uses one configuration boundary between the two main loops of 
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Example 


XPP 


#cf 


Jin A rr 


#Lat 

(CCS) 


#niax 


time (s) 


DCT1 


8x12 


1 


51/51 


12.401 


7 


1.0 


DCT1 


6x12 


4 


31/79 


10,957 


5 


0.9 


DCT2 . 


6x12 


1 


59/59 


42,034 


8 


1.0 


DCT2 


6x12 


6 


41/104 


40.516 


8 


1.2 


Chen 


16x16 


1 


64/64 


16,852 


22 


4.8 


Chen 


8x16 


5 


56/154 


10.181 


24 


3.3 


Smooth 


6x12 


1 


44/44 


7,045 


8 


0.8 


Haar 


6x12 


1 


58/58 


10,480 


10 


1.1 


Haar 


6x12 


4 


39/112 


7,683 


10 


0.9 


FIR 


6x12 


1 


49/49 


4,864 


28 


0.9 



5 

the DOT kernel). With this scheme, a gain of 4% in performance was achieved 
using 30% less PAEs. Chen is a pointer-free version, with 180 lines of C, of a 
DCT implementation used in JPEG. Temporal partitioning furnished an improved 
version: 66% in perfomiance using 12% less PAEs. The computation and data- 

10 communication is perfomned in 688 clock cycles, smooth represents an image 
filter (16x 16 image). The two inner loops (3x3 window) were annotated to be un- 
rolled and conducted to an efficient vectorization. An overall speedup of 4 (8.6 
considering only execution time) over the implementation obtained without un- 
rolling is obtained. Additionally. 2 less PAEs are used with unrolling. Haar is an 

1 5 implementation of the fonward 2D Haar wavelet transfomi. An input image of 

16x16 is used. A perfomiance gain of 36% is achieved when temporal partitioning 
is applied. FIR is a 1D FIR filter with 12 taps filtering 2048 samples. Even with all 
the overheads, 0.42 samples/cycle is computed (0.87, considering only the la- 
tency to communicate data to external memories and the FIR computation). 
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Each one of the examples was compiled in less than 5 seconds. This reveals that 
it is possible to have runtimes comparable to the ones achieved by software com- 
pilation. Performance gains obtained with temporal partitioning are shown. Since 
5 we ran most examples with small data and image sequences, the configuration 
overhead is significant 

Note that the current methodology does not use neither the full potentialities of 
the XPP nor some optimizations: (1)The execution of a partition only starts after 

10 the full configuration of its resources; (2) No pipelining between fetch and configu- 
ration for the same partition has been used; (3) The capacity of the XPP to con- 
figure concun-ently distinct PACs was not used; (4) An arbitrary order for pre- 
fetching of configurations conditionally requested is used (the order should be 
based on the most frequently taken path, e.g., detennined by profiling); (5) The 

15 configuration FIFOs in each array-object were not used. Hence, the performance 
results can he further improved. 

6. Related Work 

20 The XPP technology offers a promising reconfigurable computing platfomn. Being 
a step fonward in the context of reconfigurable computing, it permits to attack 
some of the well-known deficiencies of related technologies. The following sub- 
sections illustrate the most closely related work and reveals the most important 
differences. 

25 

6.1 High-Level Compilation 

The work on compiling high-level descriptions onto reconfigurable logic has been 
the focus of many researchers since the first simple attempts [71- Most of this 
30 work targets FPGA devices and thus need logic synthesis, even when module 
generators are included in the compilation flow, as is the case with the MARGE 
[8] compiler. In addition, such approaches also need backend mapping, place and 
route, which are very time consuming with FPGA technology. Even when pre- 

40 

SUBSTITUTE SHEET (RULE 26) 



wo 03/071418 



PCT/EP03/00624 



placed and pre-routed components are used to assist the compilation flow, the 
compilation time is still in the order of minutes or hours. 

New approaches have been used, which target research archrtectures. One of 
5 those approaches is the Garp-C compiler [9]. Although it is used for a reconfig- 
ware/software architecture, the configuration bit stream generation, based on ex- 
ploitation of instruction-level parallelism beyond basic blocks and assisted with 
fast mapping and placement tasks pemnits to target fine^rain reconfigurable ar- 
chitectures efficiently with short compilation times. 

10 

As Garp-C and MARGE, XPP-VC also uses the SUIF compiler front-end. The 
generation of the hardware structure to be mapped to the XPP is assisted with the 
pipeline vectorization ideas presented in [6]. However, the generation of the con- 
trol structure, based on the event packets of the XPP is completely new. Since 
1 5 the XPP is a coarse-grained architectuiB, which directly supports arithmetic and 
other operations occuning in high-level languages, there is no need for complex 
synthesis and mapping. The control structure is also directly mapped to objects 
handling events. 

20 6.2 High-Level Temporal Partitioning 

Temporal partitioning at the behavioral level has been already successfully con- 
ducted for FPGAs and other type of RPUs. The majority of the current ap- 
proaches try to use a minimum number of configurations by using all the possible 

25 RPU size available for each temporal partition (see, for instance, [lOj). Such 

schemes only consider another partition after the cun'ent one has filled the avail- 
able resources and are insensible to the optimization that must be applied to re- 
duce the overall execution by overlapping the fetching, configuration and execu- 
tion steps. Albeit not considering such optimizations, ILP formulations presented 

30 by some authors [1 1] are uncapable to deal with the complexity of many realistic 
applications. 
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One of the first attempts to reduce the configuration overhead in the context of 
temporal partitioning has been presented in [12]. However, the approach uses the 
simple model of spitting the available FPGA resources into two parts and per- 
forming temporal partitioning using half of the total available area as the size con- 
5 straint. The scheme only overlaps configuration with execution of adjoining parti- 
tions and does not take into account the pre-fetch steps that can be efficiently 
used in some RPU architectures. Furthermore, the approach causes problems, 
when some resources of the RPU must be shared by two or more partitions. This 
contradicts the requirement of disjoint spaces of the RPU used by two adjacent 
10 temporal partitions. 

The temporal partitioning algorithm used in the XPP-VC compiler is based on 
some ideas presented in [13]. The special characteristics of the algorithm to deal 
with resource-sharing during the creation of the partitions are not used and spe- 

15 cial heuristics have been added to deal with the fetch and configuration time of 
each partition. The purposed partitioning of loops was firstly introduced in this pa- 
per. The scheme can deal with any type of loops. The previous approaches con- 
sider loop distribution when a loop does not fit onto the RPU [14]. However, loops 
which cannot be entirely mapped onto a single configuration and which cannot be 

20 distributed are not compiled. Our method can deal with programs with unlimited 
complexity as long as the supported C subset is used. It does not depend on the 
feasibility of a specific compiler transformation. 

7. Conclusions and Future Work 

25 

This paper describes the new Vectorizing C Compiler, XPP-VC, which maps pro- 
grams in a C-subset extended by port access functions to PACTs XPP architec- 
ture. Assisted with a fast place and route tool, it furnishes a complete "push- 
button" path firom algorithmic descriptions onto XPP configuration data with short 
30 compilation times. 

An innovative temporal partitioning scheme is presented. It enables the mapping 
of complex programs and furnishes XPP applications with performance gains by 
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hiding some of the configuration time. A new mechanism to handle partitioning of 
loops, which supports loop execution by the configuration manager of the XPP, is 
also presented. Furthennore, the compiler generates self-contained configuration 
data even when several configurations are exposed. 

5 

Ongoing work focuses on tuning the estimation steps to assist automatic temporal 
partitioning and on improving the configuration data generated. 

In addition to loop unrolling, loop merging, loop distribution and loop tiling will be 
10 used to improve loop handling, i.e., enable more parallelism or better XPP usage. 
A future extension of the compiler for a host-XPP hybrid system is planned. The 
compiler will map suitable program parts, especially inner loops, to the XPP, and 
the rest of the program to the host processor. 

15 8. Acknowledgements 

We gratefully acknowledge the support of the PACT team, especially the support 
and assistance of Amiin StrobI, Daniel Bretz, and Frank May related to the XDS 
tools. 



43 

SUBSTITUTE SHEET (RULE 26) 



wo 03/071418 



PCT/EP03/00624 



References 

[I] R. Hartenstein, "A Decade of Reconfigurable Computing: a Visionary Retro- 
spective," in Proc. Design, Automation and Test in Europe (DATE'01), 2001, 

5 pp. 642-649. 

[2] PACT Informationstechnologie GmbH, Gemiany. "The XPP White Paper," 

Release 2.0, June 2001. httpVAAmnv. pactcorp.com 
[3] V. Baumgarte, F. May. A. NQckel, M. Vorbach, M. Weinhardt. "PACT XPP - 
A Self-reconfigurable Data processing Architecure", In Proc. Infl Conf. on 
1 0 Engineering of Reconfigurable Systems and Algorittims (ERSA'01), Las Ve- 

gas, Nevada, USA, June 25-28, 2001 , pp. 64-70. 
[4] SUIF Compiler system, "The Stanford SUIF Compiler Group," 

http://suif.stanford.edu 
[5] Muchnick, S. S. Advanced Compiler Design and Implementation, Morgan 
15 Kaufmann Publishers, Inc., San Francisco, CA, USA. 1997. 

[6) M. Weinhardt and W. Luk, "Pipeline Vectorization," In IEEE Transactions on 
Computer-Aided Design of Integrated Circuits and Systems, Feb. 2001 , pp. 
234-248. 

[7] Ian Page and Wayne Luk, "Compiling occam into FPGAs," In FPGAs, Will 
20 Moore and Wayne Luk, editors, Abingdon EE&CS Books, Abingdon, Eng- 

land, UK, 1991, pp. 271-283. 
[8] Maya Gokhale, and Edson Gomersall, High-Level Compilation for Fine 
Grained FPGAs, In Proc. IEEE 5th Symposium on Field-Programmable 
Custom Computing Mactiines (FCCM'97), Napa Valley, CA, USA, April 16- 
25 18, 1997. pp. 163-173. 

[9] T. J. Callahan, J. R. Hauser, and J. Wawrzynek 'The Garp architecture and 

C compiler," In IEEE Computer, 33(4), April 2000, pp. 62-69. 
[10] Joao M. P. Cardoso, and Horacio C. Neto. "An Enhanced Static-List Sched- 
uling Algorithm for Temporal Partitioning onto RPUs," In Proc. IFIPXInt'l 
30 Conf. on Very Large Scale Integration (VLSI'QQ), Lisbon, December 1- 

3,1999, Kluwer Academic Publishers, pp. 485-496. 

[II] I. Ouaiss, et al., "An Integrated Partitioning and Synthesis System for Dy- 
namically Reconfigurable Multi-FPGA Architectures," in Proc. 5th Reconfigu- 

44 

SUBSTITUTE SHEET (RULE 26) 



wo 03/071418 PCT/EP03/00624 

rable Architectures Workshop (RAW98), Orlando, Florida, USA, March 30, 
1998, pp. 31-36. 

[12] Satish Ganesan, and Ranga Vemuri, "An Integrated Temporal Partitioning 
and Partial Reconfiguration Technique for Design Latency Improvement," in 
5 Proc. Design, Automation & Test in Europe (DATE'OO), Paris, France, March 

27-30, 2000. pp. 320-325. 
[13] Jodo M. P. Cardoso, "A Novel Algorithm Combining Temporal Partitioning 
and Sharing of Functional Units," In Proc. IEEE 9th Symposium on Field- 
Programmable Custom Computing Machines (FCCM'01), Rohnert Park, CA, 
1 0 USA, April 30-May 2, 2001 . 

[14] Meenakshi Kaul, Ranga Vemuri, Sriram Govindarajan, lyad E. Ouaiss, "An 
Automated Temporal Partitioning and Loop Fission approach for FPGA 
based reconfigurable synthesis of DSP applications," in Proc. IEEE/ACM 
Design Automation Conference (DAC'99), New Orleans, LA, USA, June 21- 
15 25. 1999, pp. 616-622. 



45 

SUBSTITUTE SHEET (RULE 26) 



WO_ 03/071418. . PCT/EP03/00624 

Temporal Partitioning for the XPP-VC Compiler 
1 . Benefits of Temporal Partitioning 

5 Temporal partitioning can have a distinct and important goal than to simple en- 
able the compilation of algorithms which the mapping onto the RPU (Reconfigu- 
rable Processing Unit) resources cannot be accomplished by only one configura- 
tion. For instance, temporal partitioning targeting the XPP [1] can reduce, when 
efficiently applied, the overall execution latency. Such reduction can be mainly 
10 enabled by the following issues: 

• reduction of the interconnection lengths, by reducing each design com- 
plexity, can furnish better performance results (long interconnections pass 
through registers and thus adding clock cycle delays); 

15 

• reduction of each temporal partition complexity can itself reduce the num- 
ber of registers used for vertical routing; 

• reduction of the number of references, in each temporal partition, using the 
20 same resource, by distributing the overall references among temporal par- 
titions, can furnish better performance results as well. This happens with 
the statements presented in the program referring the same array; 

• reduction of the overall configuration overhead by overlapping fetching, 
25 configuration and execution of distinct temporal partitions. 

The reduction of the configuration overhead is due to 3 distinct sources of over- 
lapping, possible with the XPP architecture: 

30 1 . loading of subsequent configurations into the cache in parallel with the 
configuration of the current one; 



2. 



execution of one configuration while the next one is being configured; 
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3. execution of one configuration wiiile the next one is being loaded into the 
cache. 



m. 



15 



20 




26 Figure 2: CFG of the algorithm shown in Fig. 1 . The lines crossing edges repre- 
sent the XPPjnext confQ statements in the code. The bubbles containing basic 
blocks of the CFG represent the exposed regions of the CFG that are imple- 
mented in difiierent temporal partitions. 
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if (QP) {• 

if (Mode == MODE_INTRA I 1 Mode == MODE_INTRA_Q) { Intra */ 
5 qcoeff[0] = mmax (l,inmin(254,coeff [0] /8) ) ; 

for (i = 1; i < M; i++) { 

level = (abs (coeff [i] ) ) / {2*QP) ; 

qcoeff[i] = inmin.(127,inmax (-127, sign (coeff [i] ) * level)); 

}. 

1 0 XPP_next_conf ( ) ; 

} else { /* non Intra */ 
XPP_next_conf ( ) ; 
for (i = 0; i < M; i++) { 

level = abs (coeff [i] )-QP/2) / {2*QP) ; 
15 qcoeff[i] = mmin(127,inmax(-127, sign (coeff [i] ) * level)); 

} 

XPP_next_cdnf ( ) ; 

} 

} else { 
20 XPP_next_conf ( ) ; 

/* No quantizing.*/ 

for (i =0; i < M; i++) { 
qcoeff [i] = coeff [i] ; 

} 

25 XPP_next_conf ( ) ; 
} 



Figure 1: C source code of the quantization algorithm with configuration bounda- 
30 ries specified. 

(2) can have more impact if the map, place, and route phase could, as most as 
possible, to confine temporally adjacent configurations in distinct locations of the 
XPP (augmenting the concurrency between execution and configuration). ' 

35 

For reconfigurable computing platforms, where the reconfiguration of the array 
takes several clock cycles, is in most of the cases preferable that a configuration 
is reused as much time as possible in order to reduce the reconfiguration over- 
head. Thus, loops in the source code are always good candidates to be entirely 
40 implemented in a single configuration. 
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2. Specification of Configurations 

Configurations can be specified by the programmer using XPP_next_confO 
statements in the source code of a given application. Such statements must ex- 
5 pose, on the control flow graph (CFG) of the procedure, regions of code with all 
entries to the same instruction and eventually multiple exists. The compiler ex- 
poses the configurations, removes such statements from the SUIF1 [2] intemnedi- 
ate representation, checks for invalid specifications of configuration boundaries 
(when the statements expose regions with entries to different statements in a re- 

10 gion of code, or when code can be contained in more than one region^), inserts 
the code responsible to the data communication between temporal partitions, and 
generates both the NML (Native Mapping Language) [8] representation of each 
configuration and the application section specifying the control flow of configura- 
tions. Such control flow is orchestrated by the CM (Configuration Manager) of the 

15 XPP during runtime. 

Consider a pointer-free Version of the quantiser of an h263 implementation [3] 
which code Is shown in Fig. 1. Four XPP_next_confO statements were inserted In 
the code to specify three configurations. The configurations specified are repre- 

20 sented in the CFG of the example that can be seen in Fig. 2. Apart from specify- 
ing temporal partitions in such a way that the mapping to XPP is accomplished, 
there can be the case that, merging only the mostiy taken conditional paths in the 
same configuration can reduce the total execution time by substantially reducing 
the reconfiguration time (since the partitions for the other paths are not configured 

25 when they are not taken). Fig. 2 presents such a case. If the path bb_0, bb_1 and 
bb_2 was identified as the most frequently executed, such path can be specified 
to be in the same configuration In such a case, the configurations related to 
bb_3 and bb_4 will only be called when the most frequently path has not been 
taken. In some examples, paths are only executed in "degug mode" (as is the 

30 

^ Tail duplication could be applied In some examples. 

Tail duplication of bb_5 would permit to have a configuration with {bb_0, bb_1, bb_2. bb_5}; 
another one with {bb_4. bb_5}; and another one with bb_3. bb_5}; 
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case of the branch taken when QP evaluates to false in the source code of Fig. 
1). 

After exposing the configurations, the temporal partitioning phase introduces the 
5 statements needed to communicate scalar variables between two different con- 
figurations (see Fig. 3). Currently, the scalar variables are stored in 



10 





int comm [ 1 ] ; 


conf #1 


a = b * c; 


a = b * c; 


conf #1 


next_conf ( ) ; 


comm [ 0 ] = a ; 


conf #1 


d = a/e; 


a = comm[0] ; 


conf #2 


• • • 


d = a/e; 


conf #2 


a) 


b) 


c) 



15 

Figure 3: Example illustrating the communication of the value of a scalar variable 
between two configurations, a) source code; b) source code with statements in- 
serted to buffer the data; c) configuration ID for each of the statements In b). 

20 

arrays specially inserted in the SUIF1 representation of the given application. 
Those anays are mapped to internal memories of the XPP^V The temporal parti- 
tioning phase also ensure that arrays used by more than one configuration, or by 
the same configuration loaded more than once to the XPP, are binded to the 
25 same memory location and such location is not used by other arrays during the 
lifetime of the array variable^ ^. 



" in this case, the internal memories are used as data buffer for the maintainance of the original 
program behavior. 

12 

At the moment each configuration must use a number of array variables, to be assigned to the 
internal memories of the XPP, less or equal than the number of internal memories of the XPP 
(the compiler assigns each an^y to a distinct memory). However, the total number of arrays 
existant on the overall configurations can surpass the number of internal memories, if some 
memories can be shared between configurations due to the non-overiap of the lifetime of array 
variables. The data stored in memories is maintained across reconfigurations. 
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The assignment of the overall existent arrays (the initially used in the source code 
more the added ones to communicate data) to the internal memories is done based 
on the lifetimes of the arrays determined by the sequence of configurations that were 
previously exposed in the application. This pemiits, in some cases, to reduce the 
number of intemal memories needed by time sharing, among different configura- 
tions, some intemal memories during the execution of the application on the XPP. 

The XPP-VC compiler generates, for each exit point existent in each configuration, 
an event connected to one of the CM parts available in the XPP (the CM can check 
If an event is generated and can proceed with different configurations based on the 
value of the event). The generated event has value "0" if the path that activates that 
exit is taken and "1" otherwise. 

3. Mapping Applications with Configuration Boundaries in Loop Bodies 

Configuration boundaries in loop bodies can be deal by perfomiing loop distribution 
(as long as it can be applied) or by temporal partitioning the loop and 



int i; 

for (i=0;i<N;i++) { 
statement 1; 
XPP_next_conf ( ) ; 
statement2; 

} 

statements; 
a) 



int i; 
i=0; 

labl: if(i<N) { 
statementl; 
statement2; 
i++; 

goto labl; 

} 

statements; 
b) 



conf #1 
conf #2 
conf #2 
conf #3 
conf #3 
conf #3 
conf #3 
conf #4 
c) 



Figure 4: Example of the transformation applied to loops with configuration bounda- 
ries In their bodies, a) original source code; b) transformed code; c) configuration ID 
for each statement in b). 

use the CM to orchestrate the control flow. 
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Currently, loop distribution [1 1] is not automatically applied^^. All the loops with 
configuration boundaries specified in their bodies are transformed into ifQ goto 
label kind loops in order to permit the NML generation by the XPP-VC compiler. 
Fig. 4 shows an example of such transformation without the statements needed to 
5 communicate the value of scalar variables between configurations. The 3rd col- 
umn shows the configuration ID of each statement. Each configuration requests 
the next configuration to be taken (If the exit taken is to the end then only the "re- 
conf^* of the configuration is done). Conf #2 needs a conditional request mecha- 
nism to call conf #3 or conf #4 based on the value of the i<N expression. Since 

10 conf #3 always requests, at the end of its execution, conf #2, the initial behavior of 
the loop is maintained. The temporal partitioning task also creates two more con- 
figuration boundaries to preserve the initial functionality. From Fig. 4b) can be 
seen that configuration boundaries were inserted before and after the state- 
ment. Such boundaries are needed since the code before and after will be exe- 

15 cuted once and both the if header and body will iterate N+1 and N times respec- 
tively. 

The configuration boundaries inserted in loop bodies must specily, at the scope of 
the loop body, the pemiitted type of regions (already exjjlained). 

20 

Loop distribution (also known as "loop fission") will be the preferable form to im- 
plement loops, which generated NML does not entirely fit in the available re- 
sources of the XPP. Such transformation can potentially lead to the introduction of 
temporary arrays. Consider the loop shown in Fig. 5 where a configuration 
25 boundary is specified. The loop can be splitted so that the two statements are 
each one in one loop and the configuration boundary is now outside any loop 
body. However, we need to scalar expand variable s in order to maintain the 



" The compiler should check if the loop distribution can be applied on each temporal partition 
boundary existant in loop bodies. 

A reconf means thatthe resources used by that configuration are released and than can be 
reconfigured. 
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int tnip_s [N] ; 



for(i=0; i<N; i++) { 
s = ... // statement 1 
XPP_next_conf ( ) ; 

... = s // statement 2 



for(i=0; i<N; 
tmp_s (i) = . . 



i++) { 

.// statement 1 



XPP_next_conf 
for(i=0; i<N; 
. . . = tmp_s (i) 



0 ; 

i++) { 

// statement 2 



10 



15 



20 



25 



Figure 5: Applying loop distrlbLitlon as another way to enable temporal partitioning 
on loop bodies. 

initial functionality (one array witli the size of the number of iterations of the loop 
must be declared and is used to communicate each s value in each of the loop 
iterations). 

4. Execution Strategies 

To reduce the overall latency, efficient exploitation of the pipelining of the 3 steps 
presently in each temporal partition (fetching, configuring, and an^y execution) 
must be conducted. 

Two "reconf modes can be used (the user can select one of the modes in the 
options of the XPP-VC compiler related to temporal partitioning): 

• "reconf executed by the CM. In this case each configuration communicates 
with the CM sending an event, when the completion of execution, to request 
the next configuration. This next configuration starts by executing a "reconf 
command to an XPP resource of tiie configuration (that command is broad- 
casted throughout all the resources used by the configuration, and so the 
resources will be released and can be reconfigured by the next configura- 
tion). When a configuration can be requested by more than one previous 
configuration, special configurations are inserted in the Temporal Partition 
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Control Flow Graph (TPCFG''®) between each source and the sink. Such 
special configurations only command the "reconf of a resource in the XPP 
of the previous configuration and request the next one. This type of "reconf 
does not permit to have overlapping between execution and configuration 
5 between temporal partitions; 

• "reconf self applied by each configuration. In this case each configuration at 
the end of the execution broadcasts a "reconf event to all the XPP re- 
sources pertencing to it. This mode does not need addition of special con- 

1 0 figurations by the compiler and pemnits that the CM try to configure the next 

configuration called during the execution of the called temporal partition 
(when only one configuration path is presented). 

The compiler also generates NML code considering the pre-fetch (load of a con- 
15 figuration to the cache of the XPP) of configurations. When the pre-fetch is en- 
abled two strategies can also be automatically used: 

• request of the pre-fetch of all configurations existent in the application in the 
start of the execution (during this pre-fetching the flow of configuration and 

20 execution is done in the way it is specified in the application section of the 

NML file); 

• request in each configuration of the pre-fetch of the next. The request is 
done before the start of the configuration step for the current configuration. 

25 

The CM of tiie XPP pemnits also speculative configuration of a temporal partition 
that can conduct to better performance results even when the Map, Place and 
Route does not try to locate temporal partitions in non-overiapping areas of the 
XPP. The strategy tries to configure the partition speculatively used after the con- 



^® The TPCFG is a directed, eventually cyclic, graph where each node represent a configuration 
(temporal partition) and each edge between two nodes specifies the execution flow of the ap- 
plication through its temporal partitions. There is only one edge between two nodes of the 
graph and each node represents a region of the CFG of the application. 
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figuration of tlie current one. If the patli witch includes that configuration is taken, 
the CM only has to enable the start of the execution of the configuration (see the 
section of the NML code in Fig. 6 and the simulation results in Fig. 7, where 
conf_MOD2 Is speculatively configured during the execution of conf_MOD0). 
5 When such path is not taken, the CM releases the resources already configured 
and requests the other configuration. 

5. Automatic Temporal Partitioning 

10 Automatic temporal partitioning permits the automatic exposing of configurations 
oriented by two distinct goals: 

• minimum number of configurations: this goal can be achieved with algo- 
rithms that try to use all the available reconfigurable processing units during 

15 the assignment of segments of behavioral code to the same configuration; 

• minimum overall latency: this goal can be achieved by considering the costs 
to load into the cache, to configure and to execute each configuration with 
the XPP array. An important strategy that must be considered is the use of 

20 pre-fetch of configurations while one of the others is running. Arrays of con- 

stants or with pre-defined values used in one or more configurations can be 
initialized in one of the previous configurations if such one exists. This takes 
advantage of the initialization of the array carried out by using the configura- 
tion bus. 

25 

CONFIG conf_MODO { 

CONF_MODDLE(MOD0) // request the configuration of MODO 

SET(MOD0.Reconf .E = 1) // enable the self releasing of resources 

30 REQUEST (conf_M0D2_spec) // speculative configuration 

// if (MODO . CMPortO "0") then conf_MOD2_exec is requested 
// else continue 

//if (MODO . CMPortl == "0") then conf_MODl is requested 
COKF_CMPORT (MODO. CMPortO, conf_MOD2_exec, _) // take M0D2 or continue 
35 CONF_CMPORT (MODO. CMPortl, conf_M0Dl, _) // take MODI 

} 
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10 



15 



CONFIG conf_M0D2_spec { 
CONFJ^ODULE {M0D2 ) 

} 

CONFIG conf JM0D2_exec { 
SET (M0D2.Start ,A = 1) 
SET{MOD2.Reconf .E = 1) 
REQUEST (conf_M0D3) 

} 

CONFIG conf__MODl { 

REQUEST (conf_MOD2__rec) 
CONF_MODULE (MODI) 
SET (MODI. Reconf.E = 1) 
REQUEST (conf_M0D3) 

} 

CONFIG conf_M0D2__rec { 
RECONF (MODI -Start) 

} 



// request the configuration of M0D2 
// M0D2 is taken 

//enable the start of computing of M0D2 
// enable the self release of resources 
// request the next configuration 

// MODI is taken 

// request the releasing of resources 
// request the MODI 

// enable the self releasing of resources 
// request the next configuration 



// release the resources of MODI 



20 



25 



30 



35 



40 



Figure 6: Example of a section of NML code describing tlie speculative configura- 
tion concept. 
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Figure 7: Example of the overlapping among the fetch, configuration, and execu- 
tion steps of different temporal partitions. 
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From the SUIF1 representation of the C source code the temporal partitioning 
phase constructs an extended HTG (Hierarchical Task Graph)^^. Such extended 
5 graph has two types of nodes: 

1 . behavioral nodes representing source lines of code in the input program; 

2. an-ay nodes representing each array existent in the source code. 

10 

Type (1) nodes have three distinct sub-types: 

1 . block nodes representing basic blocks with one-entry and a single exit; 
15 2. compound nodes representing if-then-else structures; 



3. loop nodes representing the loops (for, while, ect). Loop and compound nodes 
explicitly embody hierarchical levels. 

20 Edges in the HTG+ represent data communication between two nodes or just 
enforce execution's precedence. 

Each behavioral node of the HTG+ is labeled with the following information (some 
of the labeling steps require estimation efforts): 



25 



30 



• block and compound nodes: number of ALUs and REGs; 



^® The model has been chosen, because it will also permit to exploit loop and task level paralel- 



ism. 
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5 



10- 



15 




Figure 8: Top level of the HTG+ for the DCT example (this top level consists of 4 
loops). Circles and boxes represent behavioral and array nodes respectively. 
20 Loop 1 reads the data-stream from an input port to an intemal memory and Loop 
4 writes the data-stream generated by the DCT code (Loops 2 and 3) from an 
intemal memory to an output part of the XPP. 



25 • loop nodes: number of iterations (unknown if unbound), and number of ALUs 
and. REGs; 

• array nodes: the size of the array, type of the elements, and, when they do ex- 
ist, the initialization values. 

30 

Each edge between two behavioral nodes of the HTG+ is labeled with the number 
of data words that must he transferred between the two nodes. 

Each edge between an array and a behavioral node in the HTG+ is labeled with 

35 the number of load and store references (... = A[i] and A[i] = ... respectively) In the 
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source code represented by the behavioral node to that particular array. The es- 
timated number of times that each load and store reference will be executed is 
also collected. Such information is used to calculate the penalty when two or 
more behavioral nodes are merged into the same temporal partition. Such penalty 
5 is related to the use of the same array by different behavioral nodes and adds an 
overhead to the execution latency of that temporal partition and to the number of 
resources needed for its implementation. 

Fig. 8 shows the top level of the HTG+ for an implementation of the DOT (Dis- 
1 0. . Crete Cosine Transfomi) based an matrix muttiplications. 

The automatic temporal partitioning phase needs 3 types of estimations: 
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Figure 9: Automatic Temporal Partitioning methodology. 



• number, of XPP resource units needed by the configuration implementing a 
single or a set of behavior nodes; 

5 

• latency for a behavior node or a set of connected behavior nodes on the 
HTG+ (tiiis does not need to be accurate to the real execution time and only 
needs to have relativeness accuracy); 

10 • number of clock cycles to fetch and configure each temporal partition (cal- 
culated based an the number of configuration words needed ^^). 

The temporal partitioning strategy does not exploit configuration boundaries in- 
side loop bodies, unless the entire loop cannot be mapped to the XPP. The gen- 
eration of this type of temporal partitions never produces better results (at least 
when tine loop behavior is ensured by the CM). The justification is supported by 
the reutilization of resources, already configured, achieved when the entire loop is 
implemented by a single configuration. When a loop does not fit in tiie XPP, the 
algorithm is applied hierarchically to the body of the loop. 

I 

Fig. 9 shows the methodology. The stifategy works around 3 levels (the cohnputa- 
tional efforts Increase from ttie first to tiie third level): 

1. Temporal Partitioning algorithm based on the estimation of the needed re- 
25 sources done with function costs based on the number and kind of opera- 
tions in the source code. The algorithm uses the HTG+ and tiie SUIF repre- 
sentation of the program; 

2. For each configuration, selected in the first level, the estimated sizes are 
30 checked with the ones estimated by generating the NML graph with the 

XPP-VC compiler. If the size surpasses the available resources, the algo- 

" Can be estimated by the number of edges, ALU nodes, REG nodes, and pre-defined values 
existant in tlie liardware grapli generated by the XPP-VC compiler 
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rithm rerun level one, relaxing the size constraint (diminulshing the maxi- 
mum number of available resources); 



3. Check if each configuration successfully checked in level 2 can be really 
5 mapped to the XPP. This level uses functions of the mapper, placer and 

router. If the configuration cannot be Implemented in the XPP, the algorithm 
returns to level 1, once more relaxing the size constraint. 

The temporal partitioning algorithm used is based on the ideas presented in [7]- 
10. The special characteristics of the algorithm to deal with resource-sharing during 
the creation of the temporal partitions have been removed and special heuristics 
have been added to deal with the fetch and configuration time of each temporal 
partition. The algorithm tries to overiap configuration and execution with fetch 
during the selection of the HTG+ nodes to each temporal partition, [describe 
15 more] 

6. Discussion * 

We call the attention of the reader for the fact that the cument methodology does 
20 not use neither the full potentialities of the XPP nor some optimizations: 

1 . The execution of a given temporal partition only starts after all the used re- 
sources have been configured; 

25 2 No pipelining between fetch and coniRguration has been used. The configu- 
ration of the XPP resources for a specific temporal partition only starts after 
its configuration words are fetched (loaded to the XPP cache); 

3. No overtapping on execution between two or more configurations; 

30 

4. The capacity of the XPP technology to configure concurrently distinct PACs 
(each PAC has its own CM); 
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5. An arbitrary order for fetching of temporal partitions conditionally quested is. 
used (the fetching order should be done by the most taken path detemrjined 
by profiling); 

5 

6. Behavioral nodes exposed in the HTG+ as concunrent nodes are not at the 
moment implemented, by the XPP-VC compiler, with parallel execution. 

Thus, we strong believe that there still be potential to Improve the perfonnance 
10 results achieved when using XPP-VC. 

7. Related Work 

The XPP technology offers an unique reconfig arable computing platform sup- 
15 ported by tools that permit to compile algorithms in C. Being a step forward in the 
context of the reconfigurable computing it pemiits to attack same of the well defi- 
ciencies presently in many, If not ail, other reconfigurable computing technologies. 
However, some of the work being done to augment the potential of such technol- 
ogy has sources in some works previously done. 

20 

Temporal partitioning has been already successfully conducted for FPGAs and 
other type of RPUs. The majority of the current approaches try to use a minimum 
number of configurations by using all the possible RPU size available for each 
temporal partition (see, for instance, [4]). Such schemes only consider another 
25 temporal partition after the current one has fulfilled the available resources and 
are insensible to the optimization that must be applied to reduce the overall exe- 
cution by overlapping the fetching, configuration and execution steps. Albeit not 
considering such optimizations, I LP formulations presented by some authors [5] 
are uncapable to deal with the complexity of many realistic examples. 

30 

One of the first attempts to reduce the configuration overhead in the context of 
temporal partitioning has been presented in [6]. However, the approach uses the 
simple model of splitting the available FPGA resources into two parts and per- 
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forming temporal partitioning using half of the total available area as the size con- 
straint The scheme only overlaps configuration with execution of adjoining parti- 
tions and does not enter into account to the pre-fetch steps that can be efficiently 
used in some RPU architectures. Furthermore, the approach can originate some 
5 problems, when some resources of the RPU must be shared by two or more par- 
titions (eliminating the requirement of disjoint spaces of the RPU used by two ad- 
jacent temporal partitions). 



[12] presents tiie scheduling of kernels (sub-tasks) targeting the Morphosys ar- 
10 chitecture. They use an efficient search pruning scheme added to an heuristic 
that pennits to consider firstly solutions which potentially conduct to the best per- 
formance results. However, they mainly orient the search to data re-use among 
the schedule kernels which is only suitable to type of reconfigurable computing 

architectures where no local memories to the RPU are available. The scheduler 

I 

15 tries to overiap computing and data transfers and minimize context reloading, 
which as we can see from the examples shown can not always conduct to the 
overall minimum latency. The scheme needs as input the application flow graph 
(without concurrency and conditional paths) and the kernel timing. The approach 
does not consider temporal partitioning and so needs that each kernel configura- 

20 tion does not exceed the context memory size. 
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Abstract 

Resource virtualization on FPGA devices, achievable due to its dynamic recon- 
figuration capabilities, provides an attractive solution to save silicon area. Archi- 

5 tectural synthesis for dynamically reconfigurable FPGA-based digital systems 
needs to consider the case of reducing the numberof temporal partitions (recon- 
figurations), by enabling sharing of some functional units in the same temporal 
partition. This paper proposes a novel algorithm for automated datapath design, 
from behavioral input descriptions (represented by a dataflow graph), which si- 

1 0 multaneously perfonns temporal partitioning and sharing of functional units. The 
proposed algorithm attempts to minimize both tiie number of temporal partitions 
and the execution latency of the generated solution. Temporal partitioning, re- 
source sharing, scheduling, and a simple form of allocation and binding are all 
integrated in a single task. The algorithm is based on heuristics and on a new 

1 5 concept of construction by gradually enlarging timing slots. Results show the effi- 
ciency and effectiveness of the algorithm when compared to existent approaches. 

1. Introduction 

20 The availability of multi-programmable logic devices (such is the case of FPGAs - 
field programmable gate arrays) with lower reconfiguration times has made pos- 
sible the concept of "virtual hardware" [1][2]: the hardware resources are sup- 
posed unlimited and implementations that oversize the resources available on the 
device are resolved by temporal partitioning. Then, the temporal partitioned solu- 

25 tion is executed by timesharing the device such that the initial functionality is pre- 
served. This concept promises to be an efficient solution to save silicon area [1]. 
One of the applications is the switch among functionalities that have mutual ex- 
clusiveness on the temporal domain, such as the context-switching between 
coding/decoding schemes in communication, video or audio systems. 

30 

Although, even the latest commercial FPGAs, such as the Xilinx™ Virtex family 
[3], do not have mechanisms to implement efficiently temporal partitioned func- 
tionalities and the time of reconfiguration of the overall FPGA is still quite high, the 
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importance of the "virtual hardware" concept has already been demonstrated with 
computationally complex applications [4]. Industrial efforts are under way to fur- 
ther improve the capability of the devices to handle multiple-configurations by 
storing several on-chip configurations and permitting the switch between contexts 
5 in few nanoseconds [5]. 

The virtualization of FPGA resources has been considered by several authors 
while dealing with circuit netlists that oversize the available resources on the de- 
vice ([6][7], just to name a few). From the point of view of the design, those ap- 

1 0 proaches work at a much low-level of abstraction, without the possibility to exploit 
tradeoffs between the number of reconfigurations and the resource sharing of 
functional units (FUs), for instance. The design automation for FPGA-based sys- 
tems should include temporal partitioning algorithms able to efficiently exploit the 
new concept. Tradeoffs among parallelism, communication costs, execution and 

15 reconfiguration times, and sharing of some FUs in the same reconfiguration need 
to be considered during the architectural synthesis phases. 

Sharing of FUs among operations is a technique to reuse a single configuration of 
an FU by more than one operation of the same type. On the other hand, temporal 

20 partitioning is a technique tailored to reuse the available resources by different 
circuits (configurations) with the time-multiplex of the device. The nodes of a 
given intermediate representation (e.g., a dataflow graph) representing operations 
have to be scheduled in time steps to be executed in each temporal partition (TP). 
Temporal partitioning must preserve the dependencies among nodes (that are 

25 already temporal dependencies) such that a node B dependent an node A cannot 
be mapped to a partition executed before the partition where node A is mapped. 
In addition, considering sharing FUs during temporal partitioning can conduct to 
better overall results (lower number of TPs and better peribrmance). 

30 Figure 1 a) shows a design flow which integrates temporal partitioning prior to the 
high-level synthesis tasks [8]. The majority, if not all. of the existent approaches 
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Utilizes. the presented flow [93[10]. Our efforts address architectural synthesis 
integrating temporal partitioning and this paper presents a new temporal parti- 
tioning algorithm that effectively takes into account sharing of FUs, while main- 
taining a small computational complexity. Besides, it is sufficiently flexible to tar- 
5 get different FPGA devices. Figure 1b shows the design flow proposed in this pa- 
per, where temporal partitioning is integrated in the high-level synthesis tasks and 
is performed simultaneously. 
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Figure 1: Design flow based on higli-level syntiiesis for reconfigurable computing 
systems: a) traditional flow; b) proposed flow. 

30 Example 1: Motivational example 



There is no distinction among the temns: high-level synthesis, architectural synthesis and be- 
havioral synthesis. 
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Consider the dataflow graph exhibited in Figure 2 (Exi). It consists of 4 additions 
and 2 multiplications. Suppose that each adder uses 1 cell and has a latency of 1 
clock cycle, each multiplier uses 2 cells and has a latency of 2 clock cycles and 
the maximum resources available on the device equals 3 cells. The dataflow 
graph has a critical path latency of 4 cycles and needs 8 cells given those FUs 
(last row of Table 1). Figure 2 shows an optimal solution (not considering the area 
of multiplexers, registers and control unit needed to implement sharing of a spe- 
cific FU) for the example with results shown in the second row of Table 1 . In Fig- 
ure 2 each gray region identifies operations that are mapped to the same FU. The 
optimal solution is achieved with only one adder and one multiplier and fits totally 
on a single TP. When not considering sharing of adders, the optimum result is 
shown in the third row of Table 1 . The algorithm proposed in this paper achieves 
those optimal results. The fourth row of the table shows the solution obtained 
when considering a leveling temporal partitioning algorithm that does not consider 
resource sharing of FUs. From this example, it can be seen tiiat resource sharing 
can reduce the number of reconfigurations and can also reduce the overall exe- 
cution latency. There are also cases where the critical path latency of the input 
dataflow graph (last row) is maintained (second row). 




Figure 2: Dataflow graph of the example Ex1 . 
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Table I. Results for Ex1. 



Approach 


(+.*) 


3SnPs 


Execution 
Latency 


Resources 
used 


Optimum (sharing of ad- 
ders and multipliers) 




1 


4 


3 


Optimum (sharing of 
multipliers) 




3 


5 


3 


ASAP (no sharing) [11] 


(4, 2) 


4 


6 


3 


Without Temporal Parti- 
tioning (no sharing) 


(4.2) 




4 


8 



The remainder of this paper is organized as follows. Section 2 fomriulates and ex- 
plains the problem. The algorithm is deeply explained in section 3, where the 
pseudo-code and the overall performed steps are fully elucidated through an exam- 
10 pie. In section 4 experimental results are shown and discussed. In section 5, related 
work is described. Finally, in section 6, conclusions are presented and further work 
is envisaged. 

2. Problem Definition 

15 

Given a dataflow graph (DFG), representing a behavioral description, G = V, E). 
topologically ordered, directed and acyclic, with |V| nodes, {vi,V2,...,V|v|} and |E| 
edges, where each node v, represents an operation and each edge ey e E repre- 
sents a dependence between nodes Vj and vj. A dependence can be a simple 
20 precedence-dependence or a transport-dependence due to the transport of data 

between two nodes. The DFG can be obtained fi^om an algorithmic Input description. 
Such pre-processing step is beyond the scope of this article, but the front-end of our 
Java compiler for reconfigurable computing systems can be employed [12]. 
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Here we assume that there is a component library with a set of FUs and there is 
one FU for each type of operation in the DFG. O represents the set of FUs, from 
the component library, to be instantiated by the algorithm. R^iax represents the 
5 resource capacity available on the device, R(Trj) returns the number of resources 
utilized by the TP iTj and R(Vi) returns the number of resources utilized by the FU 
instance associated with Vj, N(Tri) retums a subset of nodes of V mapped to ttl 

Each partition it\ is a non-empty subset of V, where for each node exists a map to 
1 0 one and only one FU instance in 0. Tr(vi) identifies the TP where node V| is 
mapped. The set of the TPs is represented by: 



M (3) 



1 5 where N represents the number of TPs. A graph G, temporal partitioned in N sub- 
sets (TPs), is correct if: 

N 

C\N(7ri) = 0 ; each node vi e V is mapped to only one TP 
(here we do not consider cloning of operations in the DFG); 



20 



25 



N 

\Jn(ju,) = V: a\\ the nodes of V are mapped; 
/=i 

V TTj e p,R(7c,) < : each TP fits in the resources available on the device; 

V eg € E, 7i(Vi ) > Tc(Vj ) : the order of the execution of the TPs does not violate 

the dependencies among operations of the DFG (necessary condition, to obtair 
the same functionality). 
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A correct set of TPs guarantees the same overall behavior of the original graph 
(when executed from 1 to N and considering a correct communication mechanism 
to transfer data among TPs). However, we are also interested on the minimization 
of the overall execution latency. The cost that reflects the overall execution la- 
5 tency in a time-multiplexed device can be estimated by the equation (1) or (2), 
when partial or full reconfiguration of the available resources is considered re- 
spectively. CS(p) returns the minimum execution latency (number of control 
steps or clock cycles) of the partitioned solution, CS(7ti) refers to the minimum 
execution latency of the TP iti (it may include the communication costs and repre- 
1 0 sents the execution latency of the critical path of the graph formed by the subset 
of nodes in m and the correspondent edges, considering that nodes sharing FU 
instances can exist). Stand 5 represent the number of clock cycles to reconfigure 
the TP m or all the available resources respectively. 



15 



CS(p) = J;cS(7Ei) + 5i 

i=1 



(1) 



cs(i&)=2cs(7i,)+vxa 



(2) 



20 The objective of our algorithm is to furnish a set of datapaths that will be executed 
in sequence with a minimum number of control steps'"®. Each datapath unit fits on 
the physically available resources. For the sake of minimizing the number of TPs 
needed, exploiting sharing of FUs while doing temporal partitioning needs to be 
considered by the algorithm. Specifically, our algorithm has to output: 



25 



The Set of TPs (p): each TP identifying the nodes of the DFG assigned to 

it: 



We assume that each control/time step for scheduling is equal to the clock period of the sys- 
tem. Thus, there is no distinction among the use of clock cycle, control step or time step. 
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Each node of the DFG has to identify a specific FU instance of <1> imple- 
menting the operation. 

5 

From those outputs, it is straightfonward to generate a behavioral HDL-RTL 
(hardware description language at the register transfer level) description of each 
TP control unit and a structural HDL-RTL description of each datapath, consider- 
ing the existence of a HDL description for each FU. The configurations can be 
10 generated from those netlists using a traditional FPGA design flow. 

3. Algorithm Simultaneously Exploiting Temporal Partitioning and Sharing 
of FUs 

15 The algorithm uses an initial number of TPs that can be specified by the user. 

Another possibility is to use the number of levels of the DFG or the number of TPs 
utilized by any temporal partitioning algorithm without using sharing of FUs (e.g., 
ASAP [1 1]) as the initial number of TPs. The user has to specify the total number 
of available resources on the device, in addition, for each FU there exists a boo- 

20 lean variable which value indicates if the FU can be shared or not (sharing of 

some FUs may need more resources than the utilization of several FU instances, 
due to the overhead of using auxiliary circuits needed for the implementation of 
the sharing mechanism). 

25 To a clear description, we show the main steps of the algorithm with a connection 
to Examplel. A brief exposition of the steps performed, when considering sharing 
of all FUs is stretched in Figure 3. 



30 
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Ncde 

0 1 2 3 4 5 
ASAP 0 0 10 2 3 
ALAP 1 1 2 0 2 3 




a) 

(x)3 Tim 


( 1*} * 


1 


c) 


















(4) s 


d) 


s) 










(?}4(X) 2 


«i @ * (J) * . ^ 








g) 



Figure 3. Algorithm execution tlirough an example: a) ASAP and ALAP start 
times; b) Tlie nodes in the critical path identified by the gray region;c), d), e), f) 
25 and g) show iterations of the algorithm. 



The algorithm stariB with the following steps: 
30 1 . Compute the set of nodes child^° of each node of the DFG; 

2. Map an FU instance to each operation in the DFG (at the moment neither" 



A node V| is child of a node Vj if there exists a path fcom Vj to the end of the DFG that includes V| 
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consider more than one FU for the same operation nor FUs capable to imple- 
ment more than one operation); 



3. Estimate the area and execution latency of each node in the DFG according to 
5. the FU characterization, existent in the component library, for the target device. 
This step is beyond the scope of this article and from now on we will assume 
that there exists, for each FU, an estimation of the number of resources and of 
the execution latency; 

10 4. Perform the ASAP (as soon as possible) and ALAP (as late as possible) start 
times for each node in the DFG (See Figure 3a)), both unconstrained. When 
doing the ALAP scheme, the algorithm also calculates the ALAP level of each 
node; 

15 5. Determine the set of nodes in one of the critical paths of the DFG (see Figure 
3b)); 

6. Create a number of TPs equal to the input number specified (see the three TPs 
initially created in Figure 3c)); 

20 

7. Assign each node of the set of nodes in one of the critical paths of the DFG 
(determined in point 5) to a TP by ascending level. When the number of TPs is 
larger than the number of nodes in the critical path, the last TPs are left empty; 
otherwise the last nodes of the set are left unassigned (see the nodes assigned 

25 to each TP in Figure 3c)); 

8. Assign the size (number of resources used) of a node in a TP to the cunrent 
size of that TP (see Figure 3c)). 

30 After the above steps, the main kernel of the algorithm is executed (see the 
pseudo-code in Figure 4, Figure 5 and Figure 6), Some of the most important 
functions used by the algorithm are listed and briefly explained below: 
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- Vj.ALAPieveiO returns the level of V| considering an ALAP leveling scheme; 

- V|.ALAPStart(): returns the ALAP start time of vi; 
5 - 7Ci.addEi(Vi): adds the node Vi to tci; 

- 7C|.rmEI(Vi): removes Vi from nu 

- 7Ci.sched(vi): returns the number of control steps of the critical path considering 
10 that Vi is mapped to nu 

- ^ .add(7ri): adds a new TP n\ to the current set of TPs (tii will be the last TP in the 
set); 

15 - p.elAt(i): returns the i**" TP from the Set of TPs (p); 

- findNodes(i): returns a list of nodes ready to be mapped to the i^ TP. 

Our algorithm will be progressively constructing a global solution. On each iteration, 
20 the algorithm traverses the sequence of the existent TPs trying to assign ready 

nodes to each TP. Each TP has an associated maximum slot time (MAXcs). A node 
ready to be mapped to a TP is only really considered for mapping if the resultant 
execution latency of that TP (considering the mapping) does not exceed the corre- 
spondent MAXcs (line 15 of Figure 4 and lines 2, 21 and 29 of Figure 5). MAXcs of a 
25 given TPtii is equal to the critical path latency of that TP added by a relax amount: 
CS(7t|) + relax. On each iteration over the TPs the relax value is incremented by the 
great common divisor (gcd) among all the execution latencies of the operations in 
the DFG (line 24 of Figure 4). When a node is mapped (see function mapNode in 
Figure 6), the critical path length of the associated TP is updated (lines 4 and 5 of 
30 Figure 6). 

The algorithm considers that nodes in contiguous time steps mapped to the same 
TP and with the same operation should be bound to the same FU instance. 

35 A list of nodes ready for mapping to a current TP is used. The list has the nodes 

sorted by increasing ALAP start times (the candidate operation having the least 

ALAP value will have the highest priority) and, for nodes with the same ALAP start 

time, it uses the ASAP Start time as a tiebreak (by ascending or descending order). 

The list is determined examining for a given node its predecessors (they 
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. already must be mapped in TPs before the current TP) and the child set (the 
nodes child of the node to be mapped must be on TPs after the TP under consid- 
eration). The incremental update of the list of the nodes candidate to be mapped 
to the current TP, when each node is mapped, is an option of the algorithm (lines 
5 6 and 7 in Figure 6). When such option is disabled the algorithm only tries to do 
update when the list is empty. The algorithm uses a static-based approach in the 
sense that the ALAP/ASAP values are calculated only once and they are no more 
time updated. 

10 1. // begin main kernel 

2. BitSet NodesSched = marked with the nodes al- 
ready mapped to TPs; 

3. int NumTP = 0; relax =0; 

4- int step == gcd(All nodes in DFG) ; 
15 5. while (notAllNodesSched (NodesSched) ) { 



6. LOOP B: while (SchedNum < maxPartitions ) { 

7. Vector listReady = findNodes (NumTP) ; 
8 , Vector tti = p . elAt (NumTP) ; 

9. while (! listReady. isEmptyO ) { 

20 10. Node V]c = listReady . rmFirst () ; 

11- int Rnew = R(7ri) + R(Vk); 

12.. Boolean fit = (Rnew <= Rmax) ; 

13. // CS(7ri) when is mapped to n±: 

14. int CSnew = Tii- sched (Vk) ; 

25 15. if ((CSnew > (CS (7ii) 4-relax) ) && (tti is 

the last TP) && (N(7i;i) ~ 0) || 

(Vk.ALAPievelO < 7t(Vk)))){ 

16. tryToSched ( Rnew f Vk, fit, CSnew - 

CS (TCi) , Til, NodesSched, update, CSn< 
30 ListReady) ; Figure 5 

17. } else { 

18. tryToSched( Rnew / Vk, fit, relax, Tii, 
NodesSched, update, CSnew/ Lis- 
tReady) ; Figure 5 

35 19. } 

20. } 

21. NumTP++; 

22. } 

23. NumTP = 0; 

40 24. relax += step; 



25. } 

26- // end main kernel 
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Figure 4. Main kernel of the proposed algorithm. 



1. tryToSched {±nt Rnew/ Node vi. Boolean fit, int 
5 relax, TP 7tk, BitSet NodesSched, Boolean up- 
date, int CSnew ListReady) { 

2. if ( (CSnew ^ (relax+CS(7ik) ) I I (CS (TCk) == 0)) 
{ 

3. if only one node Vj in { 

10 4. if (Vj.ALAPStart 0 > vi . ALAPStart ( ) ) { 

5. if ((Knew - R(Vj)) <= Rmax) { 

6. TCjc, rmEl (Vj) ; 

7. R(7Cjc) = R(Vi) ; 
8 . Ttk. addEl (vi) ; 

15 9. NodesSched. clear (vj) ; 

10 . NodesSched. set (vi) ; 
continue LOOP B; 

11. } 

12. } 
20 13. } 

14 • Boolean canShare = try sharing with a 

node of the same type with a path of 
shared Fus with the smallest length 
(number of nodes; 
25 15. if (canShare) { 

16. if (fit SlSl share produces increase) { 

17. canShare = false; 

18. rmShare (vi) ; 

19. } else { 

30 20. int CSnewi ~ TCk- sched (Vi) ; 

21. if(CSnewi < . (relax + CS (Ttk) ) ) { 

22. mapNode (ni^, Vi, NodesSched, up- 
date, CSnewi/ ListReady) ; Figure 6 

23. } else { 

35 24. rmShare(Vi); 

25. canShare = false; 

26. } 

27. } 

28. } 

40 29. if (! canShare && fit && (CSnew ^ (relax+ 

CS(7Ck) )) I I (CS{7ik) == 0) ) { 

30. mapNode (Tik/ Vi, NodesSched, update, 
CSnew/ ListReady) ; Figure 6 

31. } 
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32. if (Vi not mapped and no FU with opera- 
tion type of Vi in thisTP and vi does not 
fit and thisTP is the last TP) { 

33. create a new TP 7Cn; 
5 34- ^.add(7tn); 

35. mapNode [Tin r vi, NodesSched, update^ 

CS (TCn) / ListReady) ; Figure 6 
36- break LOOP B; 

37. } 



10 38. } . 

39. } // end tryToSched 



Figure 5: Function tryToSched. 

15 



1. mapNode(TP 71^, Node vi, BitSet NodesSched, 

Boolean update, int CSnewf Vector ListReady) { 



2. TCjc.addEl (vi) ; 

20 3. NodesSched. set (Vi) ; 

4. if (CSnew > CS (TTk) ) 

5. CS(7rk) = CSnew; 

6. if (update) 

7. upDateAndSortALAP (ListReadv, Vi) ; 



25 8- } // end mapNode 
Figure 6: Function nnapNode. 
4.1 Sharing versus not sharing 

30 

Table III shows results for the considered examples. Our* and Our** identify re- 
sults obtained by applying the proposed algorithm. Our* considers resource 
sharing for both adder and multiplier units, and Our** only considers resource 
sharing for multiplier units. #cs identifies the execution latency (number of clock 
35 cycles) and #p the number of TPs. Each solution related to our algorithm was 
obtained in less than 1 s of CPU time. 
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Table III: Results obtained for the examples. 



Example 


Rmax 


Approach 


ASAP 


SA 


Our* 


Our** 


#D 


#cs 




#cs 






*P 




Ex1 


6 








A 


I 




2. 


4 


SEHWA 


6 


1 0 


OO 


J o 


OO 


1 


Oil 


6 


34 


10 


Q 
9 




Q 


iy 


1 


18 


6 


18 


15 


O 


To 


o 


15 


1 


15 


5 


15 


HAL 


6 


o 


1 1 


o 


9 


2 


10 


3 


10 


10 


o 


lU 


o 


7 


2 


7 


3 


7 


ESNF 


6 


12 




12 


23 


1 


23 


9 


25 


10 


o 




o 


22 


5 


18 


7 


18 


15 


4 


1 C7 




1 o 


1 


17 
1 / 




1 o 


FIR 


6 


14 


28 


14 


27 


1 


20 


4 


27 


10 


7 


16 


7 


15 


1 • 


15 


4 


15 


15 


5 


12 


5 


11 


1 


11 


3 


11 


MAT4x4 


6 


72 


136 


72 


134 


1 


130 


21 


130 


10 


37 


69 


37 


69 


1 


66 


17 


66 


15 


25 


47 


25 


46 


1 


46 


10 


46 


20 


16 


29 


16 


29 


2 


29 


4 


29 



The SA results were obtained with a simulated annealing version to do temporal 
partitioning without resource sharing proposed in [16]. Here, the algorithm is 
tuned to optimizing the overall execution time (the algorithm can also exploit the 
tradeoff between execution time and communication costs). The ASAP results 
refer to the leveling technique proposed in [11]. 
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Only Mat4x4 needed to start with the number of TPs obtained by the ASAP ap- 
proach to achieve the best solution. For ail the other examples, the best solution 
was obtained starting with an initial number of TPs equal to the number of levels 
of the DFG. The results for Mat4x4 in Table III were collected disabling the update 
5 of the list of nodes ready for each node mapped (the list is updated only when It is 
empty). It is strongly recommended to disable the update option for examples with 
high-level degree of parallelism and a small critical path length. 

The values in bold in the 6* and 8^ columns of Table III show the minimum exe- 
1 0 cution latency for the datapaths obtained by the considered approaches (not con- 
sidering configuration times). The values in bold in the 10*^ column represent that, 
even without considering sharing of adders, our algorithm returns solutions with 
execution latencies equal to the execution latencies obtained sharing all the re- 
sources (8^^ column), despite the fact that those solutions need more TPs. 

15 

When considering resource sharing for all FUs, a minimum number of TPs (only 4 
cases of Table III needed more than one TP to produce a minimum execution 
time) seems to ensure solutions with lower execution latencies than the obtained 
by doing temporal partitioning with ASAP or SA for the majority of the examples 

20 (only one case is not as good as SA). Note that when all the FUs can be shared 
and the resource overhead to implement sharing is not taken into account, an 
empirical observation tell us that the solutions with lower execution latency are 
those with only one TP. This is expected by the fact that a new TP produces an 
equal or worse effect than sharing FU instances on the overall execution latencies 

25 because all the nodes in that TP can only start executing after the end of the exe- 
cution of the TP immediately before. 

When sharing of adders is not considered the algorithm is capable to find 13 so- 
lutions without inferior execution latency. 



81 

SUBSTITUTE SHEET (RULE 26) 



wo 03/071418 



PCT/EP03/00624 



4.2 Exploiting the number of TPs 

An exploitation of the overall execution latency versus the number of TPs is 
5 shown in Figure 8. Those results were produced by calling the algorithm several 
times, each time starting with a different initial number of TPs from a range of 1 to 
15. The exploitation has been done in approximately 5.4s of CPU time. All the 
solutions use only a single TP and the best result (execution latency equal to 66 
clock cycles) has been achieved when the algorithm started with 8 TPs. The re- 
10- suits without considering sharing of adders are shown in Figure 9. The algorithm 
exploited a range of TPs from 1 to 26 and the minimum execution latency 
achieved was 66 clock cycles (solution with 21 TPs). Based on tiiose results we 
can select a solution that minimizes the global execution latency taking into ac- 
count the reconfiguration times (see equation (2)). 




Initial number of Temporai Partitions 

25 Figure 8: Execution latency versus the initial number of TPs for I\/Iult4x4 obtained 
by the proposed algorithni, when Rmax =10 (sharing of adders and multipliers). 

From the results presented so far we may conclude that sharing FUs can reduce 
30 the number of TPs without increasing the overall execution time. Moreover, a 
minimum number of TPs can be a priority, when an FPGA with significant recon- 
figuration times is used. Due to its low computational complexity, tiie algorithm 
can be used to exploit the design space based on the tradeoff between the num- 
ber of TPs and the overall execution latency. 
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Figure 9: Execution latency and the final number of TPs versus the initial number 
of TPs obtained by the algorithm for Mult4x4, when Rr^ax = 10 (no sharing of add- 
ers). 

4.3 Comparison with other schedulers 



20 



At this point a question may occur: is the algorithm competitive when a single TP 
is envisaged? Table IV shows results for EWF and SEWHA, considering various 
sizes for the available resources (Rwiax)- 



The schedules obtained by the proposed algorithm considering only one TP are 
shown (see the 5**' column). The number of resources used for each type of FU 

25 for each solution is also shown (last column). "Fixed" refers to results collected 
from the state-of-the-art schedulers [17][18][19] and represent optimal (identified 
with *) or near-optimal scheduling results (without enter into account with tempo- 
ral partitioning) for the specified constraint on the number of FUs for each type of 
operation (see the 2"** column). The results show that our algorithm is efficient, 

30 even when we are interested on a final solution with a single TP. 

The result labeled with a is achieved without an incremental update of the list 
of the nodes ready to be mapped. This result shows that the algorithm did not 
skip from a local minimum, since at least the result related to Rmax =15 should be 
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achieved. The first 4 results obtained for SEHWA consider the increasing order of 
the ASAP values as the second key (there is no evidence to suggest when it is 
better to use the decreasing or the increasing ASAP values as the second key). 

5 The number of each FU Instance allocated by our algorithm for each Rmax con- 
straint only was different in two cases to the constraints used (with total number of 
resources equal to Rwiax) to produce the near-optimal scheduling results (see Ta- 
ble IV). Therefore, it seems that our algorithm can also be used to a fast identifi- 
cation of the number of FU instances needed, considering a specific number of 
10 maximum resources available on the device. 



Table IV: Comparison of scheduling results obtained for EWF and SEHWA. 





Approach 


Example 


Fixed [17] [18] [19] 


Our 




constraints 

• (X.+) 


#cs 


constraints 
Rmax(#P=1) 


#cs 


(X.+) 




(1,2) 


21« 


6 


23 


(1.2) 




(1.3) 


18 


7 


22 


(1.3) 




(2.1) 


21« 


9 


22 


(1.5) 


EWF 


(2. 2) 


18« 


10 


20 


(2.2) 




(2. 3) 


18 


11 


18 


(2. 3) 




(3.2) 


18 


14 


18 


(2.6) 




(3.3) 


17* 


15 


17 


(3.3) 




(1.1) 


34* 


5 


34 


(1.1) 


1 (2,1) 


18« 


9 


20 


(2,1) 




(2.2) 


18* 


10 


18 


(2. 2) 


SEHWA 


(3.1) 


16* 


13 


17 


(3. 1) 




(3. 2) 


15* 


14 


15 


(3.2) 




(3. 3) 


15* 


15 


15 


(3.3) 




(4, 1) 


16* 


17 


16* 


(4,1) 




(4. 2) 


11* 


18 


11 


(4. 2) 
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5. Related Work 

5 As far as we know, the development of temporal partitioning algorithms was firstly 
considered in [9][2]. The similarities of both scheduling on high-level synthesis [8] 
and temporal partitioning allow the use of common scheduling schemes for parti- 
tioning. Some authors, such as [9][10], have considered temporal partitioning at 
behavioral levels having in mind the integration of synthesis. 

10" 

In [9], a heuristic based on a static list scheduling algorithm, enhanced to consider 
temporal partitioning and partial reconfiguration, is shown. The approach exploits 
the dynamic reconfiguration capability of the devices, while doing temporal parti- 
tioning. 

15 

In [10][20] the temporal partitioning problem is modeled in a specified 0-1 
non-linear programming (NLP) model. The problem is transformed to integer lin- 
ear programming (ILP) and the solution determined by an ILP solver. Due to the 
long execution times, this approach is not practical for large input examples. 

20 Some heuristic methods have been developed to permit its usability on larger in- 
put examples [21]. Kaul [22] exploits the loop fission technique jvhile doing tem- 
poral partitioning in the presence of loops to minimize the overall latency by utili- 
zation of the active TP as long as possible. Sharing of functional units is consid- 
ered inside tasks and temporal partitioning is performed at the task level. Design 

25 space exploitation is perfonned by inputting to the temporal partitioning algorithm 
different design solutions for each task. Such solutions are generated by a 
high-level synthesis tool (constraining the number of FUs of each type). This ap- 
proach lacks a global view and is time-consuming. 

30 The simplest approaches only consider temporal partitioning without exploiting 
sharing of FUs. In [11], both a temporal partitioning algorithm based on leveling 
the operations by an ASAP scheme and other based on clustering a number of 
nodes are used. The algorithm fills the available resources in the increasing order 
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of the ASAP levels. The selection of nodes in the same level is arbitrary and the 
algorithm switches to another TP when it encounters the first node that does not 
fit on the current TP. The approach does not consider neither communications 
costs nor resource sharing. In [23] another algorithm is presented that selects the 
5 nodes to be mapped in a TP with two different approaches (one for satisfying 
parallelism and another for decreasing communication costs). In [12], an algo- 
rithm based on the extension of the ASAP or ALAR leveling schemes resorting to 
the mobility of each node to select among the nodes has been considered. [12] 
also shows an algorithm that searches recursively in the list of ready nodes so 
1 0 that if a node cannot be mapped to the current partition, other nodes can be con- 
sidered. 

[16] considers both communication costs among different TPs that can occur and 
the overall execution time. The authors presented an extension to static list 

15 scheduling, which permits to the algorithm sensitivity to the communication costs 
while trying to minimize the overall execution time. The results presented, when 
compared to near-optimal solutions obtained with a simulated annealing algorithm 
tuned to do temporal partitioning while minimizing an objective function, that inte- 
grates the execution time of the TPs and the communication costs, revealed the 

20 efficiency of the approach. 

[24] presents a method to do temporal partitioning considering pipelining of the 
reconfiguration and execution stages. The approach divides an FPGA into two 
portions to overlap the execution of a TP in one portion (previously reconfigured) 
25 with the reconfiguration of the other portion. 

in [25] constraint logic programming is used to solve temporal partitioning, sched- 
uling, and dynamic module allocation. However, the approach needs a specifica- 
tion of the number of each FU before processing and may suffer of long runtimes. 

30 

More related to our approach is the algorithm presented in [26]. A scheme based 
on the force-directed list scheduling algorithm that considers resource sharing 
and temporal partitioning is shown. The algorithm tries to minimize the overall 
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execution time, performing a tradeoff between the number of TPs and sharing of 
FUs. However, the approach adapted a scheduling algorithm not originally tai- 
lored to do temporal partitioning and lacks of a global view. Instead, our approach 
proposes a novel algorithm matched to the combination of temporal partitioning 
5 and sharing of FUs that maintains a global view. 

6. Conclusions and Future Work 

In this paper we have presented a new and useful algorithm combining temporal 
10 partitioning, sharing of functional units, scheduling, allocation and binding. Unlike 
other approaches, this algorithm merges those tasks in a combined and global 
method. The obtained results, from a number of benchmarks, strongly confirm the 
efficiency and effectiveness of the idea. 

15 The low computation time achieved, when dealing with the presented examples, 
shows that the algorithm is fast and efficient and thus can be used on large ex- 
amples. 

The inclusion of functional units with pipeline stages and the consideration of 
20 more than one implementation for a given operation will be considered in a near 
future. Another important issue is the overiapping of reconfiguration and execu- 
tion that should be considered by future enhancements. Finally, aspects related to 
conditional paths and loops will also need to be focused of future work. 
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1. introduction 

This document describes a method for compiling a subset of a high-level pro- 
gramming language (HLL) like C or FORTRAN, extended by port access flinc- 
5 tions, to a reconfigurable data-flow processor (RDFP) as described in Section 3. 1, 
e., the program is transfonned to one or several configurations of the RDFP. 

This method can be used as part of an extended compiler for a hybrid architec- 
ture consisting of standard host processor and a reconfigurable data-flow coproc- 
10 essor. The extended compiler handles a full HLL like standard ANSI C. It maps 
suitable program parts like inner loops to the coprocessor and the rest of the pro- 
gram to the host processor. However, this extended compiler is not subject of this 
document. 

15 2. Compilation Flow 

This section briefly describes the phases of the compilation method. 

2.1 Frontend 

20 

The compiler uses a standard frontend which translates the input program (e. g. a 
C program, into an internal format (IF) consisting of an abstract syntax tree (AST) 
and symbol tables. The frontend also performs well-known compiler optimizations 
as constant propagation, dead code elimination, common subexpression elimina- 
25 tlon etc. For details, refer to any compiler construction textbook like [1]. E. g., the 
SUIF compiler [2] can be used for this purpose. 

2.2 Temporal Partitioning 

30 Next, the program's IF representation is partitioned into sections which are exe- 
cuted sequentially on the RDFP by separate configurations. If the entire program 
can be executed by one configuration, (fitting on the given RDFP), no temporal 
partitioning is necessary. This phase generates reconfiguration statements which 
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load and remove the configurations sequentially according to the original pro- 
gram's control flow. 

2.3 Configuration Generation 

5 

Finally, the program sections detemiined by the temporal partitioning are mapped 
to RDFP configurations. This phase generates a program code or data structure 
which is then used to directly program the RDFP. 

10 3. Configurable Objects and Functionality of a RDFP 

This section describes the configurable objects and functionality of a RDFP. A 
possible implementation of the RDFP architecture is a PACT XPP™ Core. Here 
we only describe the minimum requirements for a RDFP for this compilation 
15 method to work. The only data types considered are multi-bit words called data 
and single-bit control signals called events. Data and events are always proc- 
essed as packets, cf. Section 3.2. 

3.1 Configurable Objects and Functions 

20 

An RDFP consists of an array of configurable objects and a communication net- 
work. Each object can be configured to perform certain functions (listed below). It 
performs the same function repeatedly until the configuration is changed. The 
' array needs not be completely uniform, i, e. not all objects need to be able to 
25 perform all functions, E. g., a RAM function can be implemented by a specialized 
RAM object which cannot perfomi any other functions. It is also possible to com- 
bine several objects to a "macro" to realize certain functions. Several RAM ob- 
jects can, e. g. be combined to realize a RAM function with larger storage. 

30 After a configuration has been removed, all information is lost. Only the contents 
(values) of a RAM are preserved during reconfiguration. 
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Figure 1: Functions of an RDFP 

The following functions mainly handling data packets can be configured in an 
25 RDFP. See Fig. 1 for a graphical representation. 



30 



ALU[opcode]: ALUs perform common arithmetical and logical operations on 
data. ALU functions ("opcodes") must be available for all operations used in 
the HLL^\ ALU functions have two data inputs A and B, and one data output 
X. Comparators have an event output U instead of the date output They pro- 
duce a 1 -event if the comparison Is true, and a 0-event otherwise. 



" otherwise programs containing operations wiilcli do not liave ALU opcodes in the RDFP must 
be excluded from the supported HLL subset or sut)stituted t}y "macros" of existing functions. 
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• CNT: A counter function which has data inputs LB, UB and INC (lower bound, 
upper bound and increment) and data output X (counter value). A packet at 
event input START starts the counter, and event input NEXT causes the gen- 

5 eration of the next output value (and output events) or causes the counter to 
terminate if UB is reached. If NEXT is not connected, the counter counts con- 
tinuously. The output events U, V, and W have the following functionality: For 
a counter counting N times, N-1 event packets with value 0 (0-events) and 
one event packet with value 1 (1 -event) are generated at output U. At output 
10 V, N 0-events are generated, and at output W, N 0-events and one 1 -event, 
are created. The 1 -event at W is only created after the counter has termi- 
nated, i. e. a NEXT event packet was received after the last data packet was 
output. 

15 • RAM[size]: The RAM function stores a fixed number of data words ("size"). It 
has a data input RD and a data output OUT for reading at address ERD. 
Event output ERD signals completion of the read access. For a write access, 
data inputs WR and IN (address and value) and data output OUT is used. 
Event output EWR signals completion of the write access. ERD and EWR al- 

20 ways generate 0-events. Note that external RAM can be handled as RAM 

functions exactly like internal RAM. 

• GATE: A GATE synchronizes a data packet at input A back and an event 
packet at input E. When both have arrived, they are both inputs consumed. 

25 The data packet is copied to output X, and the event packet to ou^ut U. 

• MUX: A MUX function has 2 data inputs A and B, an event input SEL, and a 
data output X. If SEL receives a 0-packet, input A is copied to output X and 
input B discarded. For a 1-packet, B Is copied and A discarded. 

30 

• MERGE: A MERGE function has 2 data inputs A and B, an event input SEL, 
and a data output X. If SEL receives a 0-packet, input A is copied to output X, 
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but input B is not discarded. The packet is left at the input B instead. For a 1- 
packet, B is copied and A left at the input. 

• DEMUX. A DEMUX function has one data input A, an event Input SEL, and 
5 two data outputs X and Y. If SEL receives a 0-packet, input A is copied to 

output X, and no packet is created at output Y. For a 1 -packet, A Is copied to 
Y, and no packet is created at output X. 

• IVIDATA: A MDATA function multipllcates data packets. It has a data input A, 
10 an event input SEL, and a data output X. If SEL receives a 1 -packet, a data 

packet at A is consumed and copied to output X. For all subsequent 0- 
packets at SEL, a copy of the Input data packet Is produced at the output 
without consuming new packets at A. Only if another 1 -packet anives at SEL, 
the next data packet at A Is consumed and copied^. 

15 

• INPORT[nam.e]: Receives data packets from outside the RDFP through input 
port "name" and copies them to data output X. If a packet was received, a 0- 
event Is produced at event output U, too. (Note that this function can only be 
configured at special objects connected to external busses.) 

20 

• OUTPORT[name]: Sends data packets received at data input A to the outside 
of the RDFP through output port "name". If a packet was sent, a 0-event Is 
produced at event output U, too. (Note that this function can only be config- 
ured at special objects connected to external busses.) 

25 

Additionally, the following functions manipulate only event packete: 

• 0-FILTER. 1 -FILTER: A FILTER has an input E and an output U. A 0-FILTER 
copies a 0-event from E to U, but 1 -EVENTS at E are discarded. A 1 -FILTER 

30 copies 1 -events and discards 0-events. 

• INVERTER: Copies ail events from input E to output U but Inverts its value. 
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10 



0-CONSTANT, 1 -CONSTANT: 0-CONSTANT copies all events from input E 
to output U, but changes them all to value 0. 1 -CONSTANT changes all to 
value 1 . 

ECOMB: Combines two or more inputs E1, E2, E3... producing a packet at 
output U, The output is a 1 -event if one or more of the input packets are 1- 
events (logical or). A packet must be available at all inputs before an output 
packet is produced^^. 



• ESEQ[seq]: An ESEQ generates a sequence "seq" of events, e.g. "0001", at 
its output U. If it has an input START, one entire sequence is generated for 
each event packet arriving at U. The sequence is only repeated if the next 
event arrives at U. However, if START is not connected, ESEQ constantly re- 
1 5 peats the sequence. 

3.2 Packet-based Communication Networl< 

The communication network of an RDFP can connect an outputs of one object (i. 

20 e. its respective function) to the input(s) of one or several other objects. This is 
usually achieved by busses and switches. By placing the functions properly on 
the objects, many functions can be connected arbitrarily up to a limit imposed by 
the device size. As mentioned above, all values are communicated as packets. A 
separate communication network exists for data and event packets. The packets 

25 synchronize the functions in a data-flow lashion. I.e., the function only executes 
when all input packets are available (apart fjrom the exceptions where not all in- 
puts are required as described above). The function also stalls if the last output 
packet has not been consumed. Therefore a data-flow graph mapped to an RDFP 
self-synchronizes its execution without the need for external control. Only if two or 

30 more function outputs are connected to the same function input (N to 1 connec- 
tion), the self-synchronization is disabled. The use has to ensure that only one 

^ Note that this can be implemented by a MERGE with special properties on XPP, 
^ Note that this function is implemented by the EAND operator on the XPP. 
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packet arrives at a time. Otlienyvise a packet might get lost, and tlie value result- 
ing from combining two or more packets is undefined. Tlierefore tliis sliould be 
avoided. However, a function output can be connected to many function inputs (1 
to N connection) without problems. 

5 

There are some special cases: 



• A function input can be preloaded with a distinct value during configuration. 
This packet is consumed like a nomnal packets coming from another object. 

10 

• A function input can be defined as constant. In this case, the packet at the 
input is reproduced repeatedly for each function execution. It is even possible 
to connect an output of another function to a constant input In this case, the 
constant value is changed as soon as a new packet arrives at the input Note 

1 5 that there is no self-synchronization In this case, too. The function is not 

stalled until the new packet an-ives since the old packet is still used and re- 
produced. 

An RDFP requires register delays in the dataflow. Otherwise very long combina- 
20 tional delays and asynchronous feedback is possible. We assume that delays are 
inserted at the inputs of some functions (like for most ALUs) and in some routing 
segments of the communication network. 

4. Temporal Partitioning 

25 

The details of Temporal Partitioning need to be inserted from J Cardoso's docu- 
ments. 

5. Configuration Generation 

30 

5.1 Language Definition 
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The following HLL features are not supported by the method described here: 

• pointer operations 

5 • library calls, operating system calls (including standard I/O functions) 

• recursive function calls (Note that non-recursive function calls can be elimi- 
nated by function inlining and therefore are not considered here). 

• All scalar data types are converted to type integer. Integer values are 

1 0 equivalent to data packets in the RDFP. An^ays (possibly multi-dimensional) 
are the only composite data types considered. 

The following additional features are supported: 

1 5 IMPORTS and OUTPORTS can be accessed by the HLL functions get- 
stream(name, value) and putstream(name, value) respectively. 

5.2 Mapping of High-Level Language Constructs 

20 This method converts a HLL program to a control/data-flow graph (CDFG) con- 
sisting of the RDFP functions defined in Section 3.1. Before the processing starts, 
all HLL program an^ys are mapped to RDFP RAM functions. An an^y x is 
mapped to RAM RAM(x). If several anrays are mapped to tiie same RAM, an off- 
set is assigned, too. The RAMs are added to an initially empty CDFG. There must 

25 be enough RAMs of sufficient size for all program anrays. 

The CDFG is generated by a traversal of tiie AST of the HLL program. The fol- 
lowing two pieces of information are maintained at every program point^"^ during 
the traversal: 



^ in a program, program points are between two statements or before the beginning or after the 
end of a program structure like a loop or a conditional statement 
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• START points to an event output of an object It delivers a 0-event whenever 
the program execution at this program point starts. At the beginning, a 0- 
CONSTANT preloaded with an event input is added to the CDFG. (It delivers 
a 0-event immediately after configuration.) START initially points to Its output. 

5 The STOP signal generated after a program part has finished executing is 

used as new START signal for the next program part or signals termination of 
the entire program, 

• VARLIST is a list of {variable, object-output} pairs. The pairs map integer 

10 variables (no arrays) to a CDFG object's output The first pair for a variable in 

VARLIST contains the output of the object which produces the value of this 
variable valid at the program point New pairs are always added to the front of 
VARLIST. The expression VARDEF(var) refers to the object-output of the first 
pair with variable var in VARLIST.^^ 

15 

The following subsections systematically list all HLL components and describe 
how they are processed, thereby altering the CDFG, START and VARLIST. 

5.2.1 Integer Expressions and Assignments 

20 

Straight-line code without array accesses can be directly mapped to a data-flow 
graph. One ALU is allocated for each operator in the program. Because of the 
self-synchronization of the ALUs, no explicit control or scheduling is needed. 
Therefore processing these assignments does not access or alter START. The 
25 data dependencies (as they would be exposed in the DAG representation of the 
program [1] are analyzed through the processing of VARLIST. These assign- 
ments synchronize themselves through the data-flow. The data-driven execution 
automatically exploits the available instruction level parallelism. 

30 All assignments evaluate the right-hand side (RHS) or source expression. This 
evaluation results in a pointer to a CDFG object's output (or pseudo-object as de- 



This method of using a VARLIST is adapted from the Transmogrifier c compiler [3]. 
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fined below). For integer assignnnents, the left-hand side (LHS) variable or desti- 
nation is combined with the RHS result object to form a new pair {LHS, result 
(RHS)} which is added to the front of VARLIST. 

5 The simplest statement is a constant assigned to an integer 

a = 5; 

It doesn't change the CDFG, but adds {a, 5} to the firont of VARLIST, The con- 
10" stant 5 is a "pseudo-object" which only holds the value, but does not refer to a 
CDFG object. Now VARDEF(a) equals 5 at subsequent program points before a 
is redefined. 

Integer assignments can also combine variables already defined and constants: 

15 

b = a * 2 -I- 3; 

In the AST, the RHS is already converted to an expression tree. This tree is 
transformed to a combination of old and new CDFG objects (which are added to 

20 the CDFG) as follows: Each operator (internal node) of the tree is substituted by 
an ALU with the opcode corresponding to the operator in the tree. If a leaf node is 
a constant, the ALU's input is directly connected to that constant. If a leaf note is 
an integer variable var, it is looked up in VARLIST, i. e. VARDEF(var) is retrieved. 
Then VARDEF(var) (an output of an already existing object in CDFG or a con- 

25 stant) is connected to the ALU's input. The output of the ALU corresponding to the 
root operator in the expression tree is defined as the resu/f of the RHS. Finally, a 
new pair {LHS, result(RHS)} is added to VARLIST. If the two assignments above 
are processed, the CDFG with two ALUs in Fig. 2 is created^^. Outputs occurring 
in VARLIST are labeled by Roman numbers. After these two assignments. 



Note that we use C syntax for the following examples. 

Note that the input and output names can be deduced from their position, cf. Fig. 1 Also note 
that the compiler frontend would nonnally have substituted the second assignment by b = 13 
(constant propagation). For the simplicity of this explanation, no frontend optimizations are 
considered in this and the following examplse. 
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VARLIST = [{b, I}, {a, 5}]. (The front of the list is on the left side.) Note that all in- 
puts connected to a constant (whether direct from the expression tree or retrieved 
from VARLIST) must be defined as constant Inputs defined as constants have a 
small c next to the input arrow in Fig. 2. 

5 

5.2.2 Conditional Integer Assignments 

For conditional if-then-else statements containing only integer assignments, ob- 
jects for condition evaluation are created first. The object event output indicating 

10 the condition result is kept for choosing the correct branch result later. Next, both 
branches are processed in parallel, using separate copies VARLIST1 and 
VARLIST2 of VARLIST. (VARLIST itself is not changed.) Finally, for ail variables 
added to VARLIST1 or VARLIST2, a new entry for VARLIST is created (combina- 
tion phase). The valid definitions from VARL1ST1 and VARLIST2 are combined 

15 with a MUX function, and the correct input is selected by the condition result For 
variables only defined in one of the two branches, the multiplexer uses the result 
retrieved from the original VARLIST for the other branch. If the original VARLIST 
does not have an entry for this variable, a special "undefined" constant value is 
used. However, in a functionally correct program this value will never be used. As 

20 an optimization, only variables live [1] after the if-then-else structure need to be 
added to VARLIST in the combination phase. 

Consider the following example: 

25 

i = 7; 
a = 3; 

if (i < 10) { 
a = 5; 
30 c = 7; 
} 

else { 

c = a - 1; 
d = 0; 

35 } 
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Fir. 3 shows the resulting CDFG. Before the if-then-else construct. VARLIST = [{a, 
3}, {i, 7}]. After processing the branches, for the then branch, VARLIST1 = [{c, 7}, 
{a, 5}. {a. 3}, {i, 7}], and for the else branch, VARLIST2 = [{d, 0}. {c, I, {a, 3}, {i, 7}]. 
After combination, VARLIST = [{d. II}. {c, 111}, {a, IV}. {a, 3}. {i. 7}], 

5 

Note that case-or switch-statements can be processed, too, since they can - with- 
out loss of generality - be converted to nested if-then-else statements. 

This processing of conditional statements doesn't need explicit control, either. 
10 Both branches are executed in parallel and synchronized by the data-flow. 

5.2.3 Array Accesses 

In contrast to the above sections, array accesses have to be controlled explicitly 
15 to maintain the correct execution order. For a read access the read address is 
connected to data input RD. For a write access, the write address is connected to 
data input WR and the write value to input IN. All these inputs are connected to 
their respective sources through a GATE controlled by START. A STOP event 
signaling completion of the array access must be assigned to the START varl- 
20 able. Since there's only one START event packet available, only one array access 
can occur at a time, and the execution order of the original program is maintained. 
This scheduling scheme is similar to a one-hot controller for digital hardware. 

If a RAM is read and written at only one program point, the ERD or EWR outputs 
25 can be used as STOP events. However, if several read or several write accesses 
(from different program points) to the same RAM occur, each access produces a 
ERD or EWR event, respectively. But a STOP event should only be executed for 
the program point currently executed, the current access. This is achieved by 
connecting the START signals (i. e. those connected to the GATEs) of all other 
30 accesses with the inverted START signal of the current access. The resulting sig- 
nal produces an event for every access, but only for the current access a 1 -event. 
This event is combined (ECOMB) with the RAM's ERD or EWR access. The 
ECOMB's output will only occur after the access is completed. Because ECOMB 
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OR-combines its event packets, only the current access produces a 1 -event. 
Next, this event is filtered with a 1 -FILTER and changed by a 0-CONSTANT, re- 
sulting in a STOP signal which produces a 0-event only after the current access is 
connpleted as required. See below for an example. 

5 

For computing the RAM addresses, the compiler frontend's standard transforma- 
tion for array accesses can be used. The only difference is that the offset with 
respect to the RDFP RAM (as determined in the initial mapping phase) must be 
used. 

10' 

For several accesses, several sources can be connected to the RD, WR and IN 
inputs of a RAM. This disables the self-synchronization. However, since only one 
access at a time can happen, the GATEs only allow one data packet to arrive at 
the inputs. 

15 

For read accesses, the packets at the OUT output face the same problem as the 
ERD event packets: They occur for every read access, but must only be used 
(and forwarded to subsequent operators) for the current access. This can be 
achieved by connecting the OUT output via a DEMUX function. The Y output of 

20 the DEMUX is used, and the X output is left unconnected. The it acts as a selec- 
tive gate which only forwards packets if its SEL input receives a 1 -event, and dis- 
cards its data input if SEL receives a 0-event. The signal created by the ECOMB 
described above for the STOP signal creates a 1 -event for the current access, 
and a 0-event otherwise. Using it as the SEL input achieves exactly the desired 

25 functionality. 

To avoid redundant read accesses, RAM reads are also registered in VARLIST. 
Instead of an integer variable, an anray element is used as first element of the 
pair. However, a change in a variable occurring in an array index invalidates the 
30 information in VARLIST. It must then be removed from it. 

The following example shows two read accesses: 
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X = a [i] ; 
y = a[j] ; 
2 = X + y; 

5 Fig. 4 shows the resulting CDFG. Inputs START (old), i and j should be substi- 
tuted by the actual functions resulting from the program before the array reads. 
The signal indicating the STOP of the first access is marked by STOP1 . Write ac- 
cesses use the same control events, but instead of one GATE per access for the 
RD inputs, one GATE for WR and one gate for IN (with the same E input) are 
10 used. Also no outputs need to be handled. 

Fig. 5 shows the access a [I] = x; for the simple case that the RAM is only written 
once, i. e. at one program point. 

15 This scheme executes RAM accesses correctly, but not very fast since all ac- 
cesses are synchronized even if this is not necessary. The following optimizations 
are possible: 

• Only accesses to the same RAM are synchronized. Accesses to different 
20 arrays can occur concurrently or even in changed order. When there is a 

data dependency, the accesses self-synchronize automatically. This can be 
achieved by maintaining a separate START signal for every RAM. At the 
end of a basic block [1], all these START signals must be combined by a 
ECONB to provide a new signal for the next basic block. 

25 

• For sequences of either read accesses or write accesses (not mixed) within 
a basic block, it is possible to stream data into the RAM rather than waiting 
for the previous access to complete. For this purpose, a combination of 
MERGE functions selects the RD or WR and IN inputs in the order dictated 

30 by the sequence. The MERGES must be controlled by iterative ESEQs 

guaranteeing that the inputs are only forwarded in this order. Then only the 
first access in the sequence needs to be controlled by GATEs, the other 
GATES can be removed to increase throughput Similarly, the OUT outputs 
of a read access can be distributed more efficiently for a sequence, A com- 
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bination of DEMUX functions with the same ESEQ control can be used. For 
read accesses, the generation of the last output can be sent through a 
GATE (without the E input connected), thereby producing a STOP event. 

5 Fig. 6 shows the following three array reads in the optinnized fashion. 

X = a [i] ; 
y = a[j] ; 
z = a [k] ; 

10 

5.2.4 Input and Output Ports 

Input and output ports are processed sinnilar to vector accesses. A read from an 
1 5 input port is like an an^ay read without an address. The input data packet is sent 
to DEMUX functions which send it to the correct subsequent operators. The 
STOP signal is generated in the same way as described above for RAM accesses 
by combining the INPORTs U output with the current and other START signals. 

20 Output ports control the data packets by GATEs like array write accesses. The 
STOP signal is also created as for RAM accesses. 

5.2.5 General Conditional Statements 

25 Conditional statements containing either array accesses or inner loops cannot be 
processed as described in Section 5.2.2. Data packets must only be sent to the 
active branch. Therefore, a dataflow analysis is required. Used sets and defined 
sets [1] of both branches must be computed. For all variables in either of these 
sets DEMUX functions controlled by the IF condition are inserted. They route data 

30 packets only to the selected branch. New lists VARLIST1 and VARLIST2 are 

compiled with the respective outputs of these DEMUX functions. The then-branch 
is processed with VARLIST1, and the else branch with VARLIST2. Finally, the 
output values are combined. Since only one branch is ever activated there will not 
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be a conflict due to two packets arriving simultaneously. The combinations will be 
added to VARLIST after the conditional statement. 

5.2.6 FOR Loops 

5 

A FOR loop is controlled by a counter CNT. The lower bound (LB), upper bound 
(UB), and increment (INC) expressions are evaluated like any expressions (see 
Sections 5.2.1 and 5.2.3) and connected to the respective inputs. The START 
input is connected to the START signal. The new START signal (after loop exe- 

1 0 cution) is CNT's W output sent through a 1 -FILTER and 0-CONSTANT. (W only 
outputs a 1-event after the counter has temninated.) CNTs V output produces one 
0-event for each loop iteration and is therefore used as START for the loop body. 
Finally, CNTs NEXT Input is connected to the START signal at the end of the 
loop body (I. e. ite STOP signal.) This assures that one iteration only starts after 

1 5 the pervious one has finished. CNTs X output provides the current value of the 
loop index variable. For FOR loops, dataflow analysis is required, too. 

For all variables defined in the loop body and live at its beginning, a combination 
of the input value (from VARLIST at loop entry) and a feedback value from the 

20 end of the loop is created. Next, each one of these signals is connected to a DE- 
MUX which is controlled by CNT's W output. It sends the input or feedback values 
back to the loop body (0-event) during loop execution. The VARLIST used in the 
loop body contains these DEMUX outputs. After loop termination, the input of 
feedback values are sent to the output of the loop (1-event). The varlist at the end 

25 of the loop contains these DEMUX outputs. Inpute not defined in the loop are 
taken from the input VARLIST. 

The processing of the loop body requires some special consideration. Data pack- 
ets from variables defined outside the loop but only used inside the loop (not re- 
30 defined) do not lead to the creation of a feedback signal as explained above. 

Therefore only one packet is available (unless it is a constant), but it is consumed 
in each loop operation. This would stall the loop operation from the second itera- 
tion onwards. Thus it is necessary to multiplicate the packet for each loop opera- 
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tion This is achieved by a MDATA function with the SEL input connected to CNT's 
U output. 

These methods allow to process arbitrarily nested loops and conditional state- 
5 ments. 

Fig. 7 shows the generated CDFG for the following for loop. 

a = b + c; 
ia- for (i=0; i<=10; i++) { 
a = a + i; 
x[i] = k; 

} 

1 5 Note that only one data packet arrives for variables b, c and k, and one final 
packet is produced for a (out). No GATEs are inserted for the RAM write ac- 
cesses since the packet generation is controlled by the counter anyway. 

5.2.7 WHILE Loops 

20 

WHILE loops are processed similarly. The STOP signal (new START signal) Is 
generated from the loop condition, fed through a 0-FILTER. When the loop fin- 
ishes, an additional signal (similar to the CNTs W output) must be generated 
which controls the DEMUXes to generate an output. 

25 

5.2.8 Parallelization, Vectorization and Pipelining 

The method described so far generates CDFGs performing the HLL program's 
functionality on an RDFP. However, the program execution is unnecessarily se- 
30 quentialized by the START signals. In many cases this is too restrictive. Several 
optimizations are possible. 

Independent loops (operating on different variables and arrays) need not be se- 
quentialized. They can use the same START signal, and operate independently. 
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After execution, their STOP signals must be combined by ECOMB, fomriing a new 
START signal for the subsequent program parts. 

In some cases, loops can be vectorized. This means that loop iterations can 
5 overlap, leading to a pipelined data-flow through the operators of the loop body 
[4]. This technique can be easily applied to the method described here. For FOR 
loops, the CNT's NEXT input is removed so that CNT counts continuously, 
thereby overiapping the loop iterations. Since vectorizable loops have no memory 
access conflicts, the read and write accesses to the same RAM can also overiap. 
1 0 Especially for dual-pori:ed RAM this leads to considerable perfomnance improve- 
ments. In this case separate START signals must not only be maintained for each 
RAM, but also separately for read and write accesses. 

Finally, loop transfomiations like loop unrolling, loop distribution, loop tiling or loop 
1 5 merging [4] can be applied to increase the parallelism and improve perfomnance. 
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